 Yes, it is my pleasure to introduce Sam Chang. He's in the department, he's a distinguished professor of mathematics and statistics at San Diego State University, and also has a visiting research appointment as a mathematician at the Scripps Institution of Oceanography. Sam had really many visiting positions I was impressed in Canada at the NASA Goddard Space Flight Center in Tokyo. He has filed patents, which is something we often don't do in the atmospheric sciences, but it's quite common in other scientific and technical departments. And I has, he has published three books and over 100 research papers and top journals. Sam, I'm looking very forward to your talk. Well, thank you, Judith, for your introduction. I'm not only mathematician here, we have quite a few good mathematicians here like Anish himself, Matt Chidon. So, when I learned a lot from Anish when he was at Scripps. Anish asked me to talk about some statistics tools and uncertainty, that's the title of my lecture. So, I like to discuss the following, I put statistic, putting a big data in front of statistics just try to be modernized myself. And so, where's the first topic, I will select three topics, and that's on the data delivery and data visualization and code and reproducible results. And then I will talk about correlation regression. And then for uncertainty, I like to discuss about the error estimation and such technical issues on data simulation. I like to start with Chidon's slides. I really like that. Chidon said a few keywords here that, you know, the good framework should be, he said it must be, mathematically expressed. I like that because the hand waving time is gone, so we really need to be, our field has become a quantitative field. And our results now is not just as this pattern looks like that pattern, and it would be testable against observations using scale scores. So, this, the predictions should be quantitative, and that means they're all supported by data. And so my first topic is about called four dimensional visual delivery of big climate data, we call it 4DVD. So, a DVD is a machine to play music, and a desk, and so 4DVD is a software, so DVD is hardware, 4DVD is a software. And so the interface looks like this. And I'm going to show a couple examples, and I'll give you a very quick demo. Like, for instance, one spot, and so you click on spot, this is like a San Diego area, and this is a 20th century real analysis, this model results. So you click on that spot, you can get time series, you can get time series at different levels. This all together has 24 levels for real analysis, for this 20th century real analysis data, and then you can get the time series all the way from 1851. And then you can download the data, you can compute the trend, you can do all kinds of statistical calculations with that. And then you can also look at dynamics, if you use it for dynamic analysis, for instance, you can look at the VWINCH, also from this 20th century real analysis example. And so you can see the ITCZ, and you can see HADME cells, but when you see the HADME cell, you know that the real HADME cell is not the same as what you see in the textbooks, because textbooks, those are hand waving schematic diagrams. Those are nice, but actual ones, even though it's a long time average, it's not as nice, it varies from place to place. So here, let's go to that. For example, if you have computer in front of you and you do, you can just do this, just 4dvd.org. So www.4dvd.org, you'll get this. You get this, and you can rotate this and see different spots, and you can see different areas. Zoom in and zoom out. So zoom in, zoom out, do all kinds of things. So like if you click someplace like San Diego or Colorado, so just click a spot, and then you'll get that spot, you get this, this time series, you click on time series, and then you get this time series, and then you can get different height levels, and say for instance this, 500, and then say somewhere like 200, and then you get to the higher, and then you can do some calculations. For instance, like you'd say, well, just look at, for instance, like you'd say statistics summaries, think of it as a say, is minimum percentiles, 25, 50, million, 75 standard deviations, variance, skewness, and cathosis, et cetera. And so you can, you can also look at, you know, for the general public, some people may want to look at, say, some, you know, general climatology, and also you look at this is that the green is general climatology, and the red one is historical high, and then the blue one is historical low. And so this is something that is very good for the classroom teaching. And then if you like the data, you can say, I want to download the data, so the data will come to you right away, right away, no, no, no waiting time. And then you get it, you get it instantly. And so that is, and then the students can play with the data and use, or Python, or even Excel itself to play with the data. And there is histogram, et cetera, you know, this is data. Okay, so you get this, you get the time, time, and the data, and then to do a launch shoot, et cetera. And then you can, you know, if you say, I like the map, this map, and the map, it can be in different shape to, you can get a map into like, you know, you can change it to, like, just regular map, two dimensional map like this. Or you can get different color, all kinds of you can put a bottom topography, the mountain, you know, all kinds of things in the river lake. You can use options, color, you can use all kinds of colors. And so, and then you say, well, I like this data, and then you can just download the data for this map. So that is just something you like, and you can just download there. Instead of downloading like, you know, several hundred gigabytes of real-time is this data, and you get something what you want right away. And how can that be done so fast. Because three technologies, one is optimization for database. And one is distributed computing. And one is the fast query into the system. So we have that kind of design. So that means all this, this, what you have seen is done on your computer, not on service computer. You can make it really fast. And so, let's get back to this, this PowerPoint slides. So, and then you know this looks pretty useful tool for data visualization and delivery. We are, you know, we're continuing developing this. And we have a lot of several people working on this for many years. And we started working on this in 2012, first released in 2016. And currently we have like four people working on that. And then like putting a bit more, we want to beautify their sense, you know, like put a light mode, day mode kind of thing. Just like not school, you know, many of you have seen that one. So beautiful. And what's the difference between ours and other products, because we have data manipulation functions, we actually do it as a data machine. So a quicker version say, well, I want to use that. And I have my own data. I want to quickly use this as a tool just like use this, such as the property or error view, or those things. I want to produce a beautiful figure. And now we want to say we want to quickly you can use our machine to produce your data instead of, you know, we will mount this data for you. And we will want to have different method projections. We have, you know, this two kinds of selection, we have 300 single selections, we're going to have more selections. And we're going to do some cross sectional maps to do the diagnostics. And we want to do more space time maps to show the wave propagation. And quickly this will be useful for us to us when we look at the data. And we want collaboration with you, you know, one model collaboration is the simple one is that you have data, and we can give it for you, we realize it for you, for instance, if we have say one of the same models, and we put it for put it on for you. And other collaboration model will be like we customize our for DVD for your data set. For instance, like this is the example we did with Donata. Donata, thank you. Your presentation talked about it in this lecture. And this is the ArgoVis. So this started in 2017 when we, Donata and I put together this a statistics workshop at SIO. And then so we found there is a match here. And so we did this, this the ArgoVis machine. And so that was a master's thesis by Tyler Tucker. So he was in the, in the, in my lab. And he was interested in the Donata's data and then so they produced this wonderful tool for climate for ocean scientists. Yeah, I think Donata, you can tell me more. And then we were talking about to get a school kids to adopt ArgoVis and using this machine to say, this is my boy, a kid, and they can track where it is and what is the data look like. A person can start his kind of own, you know, web, Twitter, or this kind of things, social media, and to get the science to the, to the general community, to the general public. So there are many other products out there in the world. And so, you know, data visualization, this is a big deal. So many people getting into it. And so you could see that not school, perhaps that is the one, the best known to people. So beautiful to see this. This, you know, when you see this, you know, the blocking, you can see the blocking right there. So nice. But not school cannot do data service. So, you know, they have that kind of a strength. And we do our own sense when our focus is on data delivery and then data visualization. So we actually want users to play with data. And so, so our main, you know, one of our main audience is for school kids for classroom use. And there are many others, you know, they come in the engine, Google Earth engine, et cetera, just all good stuff. And we have one right at Colorado, you know, this right read, read is Colorado product. Right. But those that read is more for scientists, a high school kid probably cannot play with it. Each one has its own use. Right. And the other tool is that we cooperate and we incorporate for DVD with a book in climate mathematics. And so, so that, you know, we have this, this data, you use the time series data, or you get to this map data. And then you say I want to replay with it. I want to have my own customization customization. So you can, you know, we provide a list of standard data analysis tools in the book. We have the Python code, and we have the R code. And that is, is this for students to use as like a reference menu, you know, for lab, you know, your lab. You can see this is this is this will be the resources from the book. So you can get a computer code, you can get the end there is a YouTube tutorial of our and Python. And also we have the data set, you can download the data set by letter CDF, and all kinds of things. And, and then you can set every figure in the book can be reproduced by a user instantly in the in classroom or when students doing homework. And so we think that this work provides a good way to train climate science students with useful mathematics we got useful mathematics. Often your way we learned mathematics in college, often our math professors told us math is so beautiful, and then tell you that math is useful, but the professor will never tell you a useful mathematics example. For instance, you never learned how to compute your apps from your math professor, but that probably the first thing you will get when you go to work at the professor's lab. And also professors that frustrated in sense that when students come in. So they feel, you know, they feel like they have obligation to provide a senior student to tell the junior students say how to use this software that software this plot in that data, etc. And what we try to do here is that we want to train students in the and grass stage with linear statistics calculus, etc, and putting and coding. So that students will be able to will be ready to do research at the senior year or at master level or at entry pitch the level or going to work and they can use this to get a good job. Okay. And this is another one is outreach. So this summer just at just done last just last month. So, so we were training teachers local school teachers, and to use for DVD and are so the kids are very much interested in climate change. You get them to get a kind of data, and you must be easy and fast, and also funny interesting. And so they can enjoy it if you're also going to know our climate data online, you can find those that that, and they just spend five minutes just sticking up on it. And this is what I have another new. And this is the teachers have found they really like it, you know, a lot of figures are from the book they reproduce that. So those figures that they produced, you know, so. And, and then let's, let's to jump to this. It's an uncertainty. So there's this to this tools and then let's look at this uncertainty. And so one example of this is called spectral optimal average. And for the, for the, for, you know, so the question is this you asked this way, if you have stations in different spots. And then you find the original average. And then you say how good it is, what is it, what is error. And this, it depends on three things. One is location, which spots, they are geometry. And other one is how good is your opposition, your operational data. Now the instrument, the bad or good certain relative climate dynamics, ocean will be different from land desert will be different from, you know, from the forest areas. So, so how could we have a formulation that would incorporate all these three together. You know we learned objective analysis, that often is like you incorporate the first one and the second one. So, so often is the first one. And sometimes you can see the second one. And then we found out that in our objective analysis learning the error analysis part often is not very useful. And then the result tend to be quite useful. But the, the error part is basically totally, totally useless. So often you see the publications when using objective learnings, objective analysis. They don't use the, the error estimation from the objective analysis mathematics, rather, they use the error from cross validation. And so that basically they use one theory together without they use a little theory to do the, to do the, to do the error assessment. And there are many publications and I learned this method from Jerry North, and I learned a lot from Tom Smith and Tom was at NOAA also. And so we did quite a few papers in early 1990s. I started this with Jerry, and working on the satellite error, the prem error, prem satellite observational error. And then we worked with CPC I was, I was working there. And, and, and then with, with Tom Smith, Sparrow, Ken, and chat Robdowski, and try to answer the question for, for, for the, for WMO is that why is that where should we, if we only have 50 some, some stations along the globe, where should we put them. And this you heard is the Angiocultural over network. And other one would be that this for global warming assessment, what are the error bonds, you know, what is error bonds so in 2001 IPCC, and we quantified the error bonds by this method. So what that is, this is a kind of headache mathematics. And what is this, this incorporates this, all the three things. So one is that's locations. So this is a location, so the location called our eye. And if we proceed to see is the empirical orthogonal functions. And this, this lambda is the eigenvalue of that empirical orthogonal function mode. And then this is data error part. So this is data itself has error. So you have data error location, and then climbing the dynamics. So all these three things put together, you get into this. So this there is, you know, this is the way. So what is the main main principle, the principle is, is really a regression minimization of mean square error. Okay. And then you can extend this concept to ask other questions, suppose that you have scattered observations, you want to grid your data to, you know, to the grid boxes, and then you know what will be errors on each grid. So this becomes the error mapping problem. So you produce a reconstructed data. And then you want you, at the same time, you should also produce a error map together with it. And then, you know, when we, when you do forecast same thing. We have done a few SDS prediction with Tom Smith. And so, so we want to produce this error map. So that theory was established in 2001, when I was a visitor at God, God, a space and flight center with be a loss group. So bill directing me to work on this bill. I remember when I went there in 1999, as a visitor. And I see a visiting scientist visiting scholar. And he, he taught me that, so, well, you know, you probably good at math, but I want to work on something useful. Just like this morning's talk by Ndibi said that, you know, those computer scientists, they might be smart, but they work on different scales, different signals. So, so I learned a lot from Bill on this. And this is a slide that's really interesting to me, and I'm here and talked about this how we could misunderstood a misuse correlation. And so if you just look at correlation or standard deviation. That's not enough. And so, so often we often have that problems. Many papers published have this kind of errors that mean wrong. So what happened is not because the regression is wrong. It's not because the correlation is wrong is because we have not checked assumptions. So what are these assumptions. This is for assumptions in the, in the, in the, in the linear regression when we took the statistics course on this. So why is a model everybody agree with that yes there is such thing otherwise wouldn't wouldn't do it. Okay, then second one is that the model data and the error data both need to be long and distributed. And that we need to test that third is that the, the residuals should not vary too much. So there's pretty much constant variance with respect to X. And then the independence of the errors, the reason residuals should not have serial correlation. There's something called a building Watson test. And the number four is the most error is the most error people may have made in publications. And they just say, Oh, this is a degree of freedom, but actually because of serial correlation, the degrees of freedom become much less, the error become much bigger. And that there's nothing new is all your textbooks, except that this text books are hard to understand. There's so much mathematics. So they don't talk about physics. So in our climbing math book, we say that every important mass formula must have a climate science example, and every important climate science description, must have a mathematical description. Sam, you have about five minutes. Yeah. Thank you for reminding me. Yeah. So, so this is a lot of the example here is from Sergey's presentation. And now this is this we all know this data simulation, you know, this, this formulation is familiar to everyone here. So, so there are some of the mind assigned equal sign from so guess slides, those the copy, his copy, missed this, the science. So I cannot reproduce that but so some place who has the universe metrics, etc. And so this will be the key parties here, he was so good point out to people said that this there is our metrics is observational error, and there is a P metrics is focused error. Those two matrices are not known, you need to assume them, or need to compute them. And data, data assimilation, often people don't tell you this, this is my data simulation results that don't tell you this. And Sergey is really good and tell you, you know, the trick inside there. And what is the uncertainty, how sensitive is the result to this. And I like another lecture by Haim Kim is on the predictability and what it is. And of course, many people talk about it. There was a huge discussion on this huge debate. And so here we will, I look and say, what, what, what does the dictionary say, what can be just as the, you know, that's saying is from not what we want. And there is a definition start net and Wikipedia. That is the second one that promise what we want. But how do we quantify this. The question hit scores, like to size scores, quantize scores, and I MSE is cross many nations. And a very important part reproduced reproducible results, we need to reproduce our results. So, so this is reproducible like this is the papers and some journals are in this trend now. So this is the one example is that this is the PS paper. And the share with all the data or the code you the reader can reproduce everything very easily. And there's the ESST journal that you require you to do that. Okay, so we recently put a paper there. And, and then also another way to help us to understand paper better is that nowadays particularly with machine learning, we put flow chart flow chart is really good for for for people to understand your work. So, so we, you know, NCAR has that the big flow chart program chart. And for each program each one, each product has a sub flow chart that makes your result reproducible. So summary, and take home messages, and four things you know why is for DVD, why is that book, and when is that correlation, or how do you deal with the correlation always project schedule diagram, always do that. And a reproducible results. So you need to know I suggest that we all produce the open source code and data so that others can reproduce our results. Thank you very much for your attention. Thanks very much. I was especially pleased that you talked about how important it is to have serial correlations. And one comment I have to this this plot we showed about them, and also until show this the stochastic differential equation community. There's a big thing that you want to get first and second order statistics right but also lag time correlations, and a lot of these misunderstandings or a way, once you look at at least one time lag ideally more than one. And that gives you a lot of insights about linearity and nonlinearity as well. My question for you Sam is, what do you think right now, given that this is sort of an S to S science the summer colloquium, which area do you think the mathematical sciences and doing really, you know, math approaches will be most beneficial in advancing or understanding about as to a science. Well, I think I need this talk actually answer your question really is to explore the insight of machine learning fear the machine learning is a way our group is doing that actually I was going to present some of machine learning results from deep ocean really we find the physics to support your, you know, what do you want to get. And then you try to find the proper tool, and then you need to understand this tool. And, you know, we often think that machine learning is a black box but it's not. It actually is built upon this, this logistic regression that means it's like our mind, you say, this is red, this is the black there is a threshold. And so there is that kind of mathematics, and that should be explicitly researched and step by step maybe put that example step by step and each step has the mathematical criteria. That's one of our things that most important part. Thanks so much.