 I'm Fred Olson, I'm a postdoc at Virginia Tech. I work with Quinn and Kaelin Kerry here, working on a range of different forecasting projects and I've been leading on the aquatics theme of the forecasting challenge in the last year or so. I have a background in lake ecology, which might pop up in the next hour or so, as you see what we're gonna be doing. Yeah, that's why, Quinn, do you wanna introduce yourself? Yes, I'm Quinn Thomas, associate professor at Virginia Tech and in the lead of the ecological forecasting initiative, Research Coordination Network, which is a national science foundation funded project to grow the field of ecological forecasting by harnessing neon data. And as part of that, we are hosting, we designed and are hosting the Neon Ecological Forecasting Challenge, which challenges you or anybody to submit your forecast of neon data before the data is collected. And that's the foundation for the workshop we have here. All right, hopefully you can see presentation. No, that's great. Okay, so can you predict the future? That's the eternal question, right? Today we're gonna just, I'm just gonna give you an overview of what we're gonna cover. I'm gonna start off with some key concepts of ecological forecasting, make sure we're all on the same page. We're gonna introduce a bit more about neon. Probably you all know what neon is by now and the forecasting challenge. And then we're gonna switch over to a walkthrough of a forecast workflow that we've developed as like a tool and a framework to get you forecasting really quickly and give you the tools to enable you to submit your own forecasts. And then towards the end of the workshop, I'm gonna point you to a few more resources to find out more information about the challenge. If you're really interested in getting involved, how you can do that and some resources to make that easier. As I said, we'll start off with a introductory presentation. I'm gonna go through the R setup that was emailed out to you just in case anyone was having issues with that. And then we'll get on with the hands-on coding or just feel free just to follow along if you don't wanna do the coding specifically. And then there'll be time for questions and time for you to have a go working with the code and modifying the model that we've given you. Feel free to interrupt with questions. I'm gonna stop at specific points for you to ask questions, but if you have a question about anything that I'm talking about or any code that I'm presenting, please feel free to interrupt by raising your hand or just shouting out. So why do we forecast? Many of us will have looked at the weather this morning. I know I did. Suddenly turned very cold in Blacksburg this morning and I wanted to know whether it was gonna rain and whether I need to bring my umbrella. So small daily decisions like that are dependent on forecasts but much larger decisions like how to manage extreme weather events or more recently in the summer we had these large air quality issues coming from the wildfires in Canada and forecasts of those types of things can help us make decisions about how to manage our lives and our ecosystems. And ecological forecasts just like meteorological and air quality forecasts can help us make those informative decisions. Today we're gonna be talking about near-term iterative ecological forecasts and just to break that down, we're talking about forecasts that are generated on the subdaily to decadal timescale. So things that are management relevant, we can make decisions on a timescale that we can then evaluate and evaluate and update and make increasingly improved decisions which brings us to the iterative part. This process of repeatedly validating, updating initial conditions or model parameters and issuing new forecasts every time data become available. So we're constantly collecting data and we can iteratively improve our models. Ecological forecasts, future predictions of physical, chemical or biological variables and importantly with quantified uncertainty. So the future is inherently uncertain and quantifying how uncertain we are about those future conditions is really important. To give you some examples of what I'm talking about, you might have a forecast of dissolved oxygen concentration for the next one to 48 hours if you're planning to stock a river with fish, for example, you don't want them all suddenly dying if the oxygen concentration is gonna rapidly drop. You might produce a forecast of the percent chance of leaf fall to estimate when the optimal time is to go leaf peeping. I can say it was probably about two weeks ago in Virginia but potentially three months ago it would have been useful if you wanted to plan a trip. Or if you're planning on going hiking, you wanna know what the potential tick abundance is gonna be so that you can estimate the likelihood of interaction with these species. What this might look like graphically, we're constantly collecting observations of these ecological variables and these can be shown on a time series like this. So these are past observations and then we're gonna use these observations to build a model and make a prediction of the future. Now you can see here this future prediction going out seven days in this case or seven hours or whatever the time step is has this band of uncertainty around it. We're quantifying the uncertainty and that's really important. And when we think about ecological forecasts. Ecological forecasting is still an emerging field and is rapidly growing. You can see the number of forecasts being produced has exponentially increased in the last 20 years and we're continuing to develop this field. It has lots of potential for both fundamental and applied questions when we think about ecology but there's still lots to learn about the predictability and the best methods for this field. Also forecasting is challenging. The endeavor of producing a real-time forecast is not an easy one given the infrastructure needed to collect data, disseminate data, develop and deploy models, evaluate and distribute our forecast to end users and there's the need to do this in a repeatable and reproducible way so that we can continue to produce new forecasts in this iterative cycle. In other fields like economics and computer science, competitions and challenges have been used as a way to push a field forward and we've developed this challenge as a way to catalyze the progress in ecological forecasting. We see this as an organizing principle for the community to get behind and develop a community of forecasting with common standards, with the ability to develop tools and infrastructure that can be common across the field and a challenge helps us develop a platform to start to do these things. It also helps us answer questions of predictability. If we can get lots of people forecasting the same data using lots of different models, lots of different sites, we can start to answer questions about how predictability varies across ranges of different scales of models, sites and variables and neon data is the perfect opportunity for this. And then the neon forecasting challenge was born. So it's been going now for two full years with the goal of the research coordination network to lower the barrier, build community and infrastructure and develop this platform for ecological forecasting. It's also helping neon achieve its mission. So using data from across the 81 sites, excuse me, the data collected for forecasting is achieving the mission of neon, which was initially set out in the early 2000s and includes forecasting as one of their key aims. So what is the challenge? As I said, and as Quinn said in his introduction, this is a platform for the community to make predictions of conditions at neon sites before the data are collected. So this is a real time forecasting challenge of the actual future. We don't hold data back. We don't have any competitive advantage by being the organizers. This is real future data that we're trying to predict. The data comes from all 81 neon sites, both terrestrial and aquatic and we'll talk about that in a sec. It covers five different themes across all levels of ecology from ticks and beetles, thinking about community and populations, leaf phenology, thinking about these phenological processes and as well as covering terrestrial and aquatic sites from our terrestrial fluxes challenge through to the aquatic water quality, which we're gonna be looking at today. Just to reiterate, a forecaster's prediction of these future environmental conditions that includes quantified uncertainty. As an overview of how the challenge functions, we are using data from our ecological networks, neon, in this case. We're also collating numerical weather forecasts. A lot of our models that we use to make predictions also include weather covariates as some of the driving variables and therefore having like a common weather forecast that teams can use in their forecasts gives kind of a consistency across teams. These are all made available to our forecasting teams who produce the forecast and submit these to the forecast catalog where they are automatically evaluated every time new data become available. These scores are made available for teams to look at both on our dashboard as well as made available for them to do their own analyses. Another key part of the challenge is the training, such as what I'm doing today and many templates that have been developed to really help people get started with their forecasting and get involved in the challenge. If you're interested in learning more in-depth about the challenge and how it was developed, you can look at Quinn's recent paper in Frontiers and Ecology in the Environment. And there's a link in the chat. And there's a link in the chat. Thank you, Quinn. Today we're just going to be focusing on this little section of the challenge here, submitting a forecast. So don't worry about the rest of it for now. We're working on getting you submitting forecasts as soon as possible. Okay, so I'm going to pause here for a second in case anyone has questions broadly about ecological forecasting or the challenge or the data. Feel free to shout out or put any questions in the chat. Okay, I will continue for now. If you do have questions or you think of anything, feel free to put them in the chat when you think of them. So we're going to move on to the walkthrough, the hands-on part of the webinar. As I said in my introduction, we're going to be focusing today on the aquatics theme. Now, this is purely an example. Like the tools and the workflow that I'm going to show you today will be applicable to any of the other themes. So if you have a real burning desire to forecast beetle populations, the tools I'm going to show you today will help you do that. We're using the aquatics data because it's good data. I'm a limnologist by training, it's interesting, but I promise you the tools that we're going to show you today are going to be applicable across the different themes. So why would we want to forecast aquatic environments anyway? Water temperatures are really important in many biogeochemical cycles for other water quality parameters and they're important for thermal sensitive species in determining their habitat. So are we able to predict how water temperatures are going to change over the next month? That's the question that we're asking. We're going to be today focusing on water temperatures in the lake systems that NEON collect data in. Using this particular data product, if you're interested, you can go into the NEON portal and have a look at how the data are collected and things like that. We're going to be looking at the next 30 days as our forecast horizon is what it's called. So how is water temperature going to change from now until the end of November? The data that are available as part of the challenge are a latency of two to three days. So that means that the data are available as part of the challenge just two to three days after they're collected, which is amazing. We have almost real-time data about what is happening in our lakes. We're going to be using a very simple baseline model just to illustrate the workflow that is possible for ecological forecasting. Now, I'm not saying this is the best forecasting model for water temperature, it's probably not, but that's the challenge for you to come up with a better model than I'm going to show you today. Just to reiterate, NEON is an awesome dataset to use in ecological forecasting and one of the reasons is because it has such a diversity of sites. So just within the lakes, we have seven different sites across a range of eco-demands and you can actually part of the challenge we're looking for water temperature forecast across all aquatic sites. So how does our ability to forecast water temperature vary from our lakes to our rivers and streams? And NEON has the diversity of sites to help us answer those questions. The data that we're going to use today was collected using these automated sensors, which are in the middle of the lakes. Most of the year, you'll see some of the data, there's gaps when they bring out the sensors when the lakes start to freeze. But if you're interested in knowing more about how the data are collected and about the field sites that are included, you can go on the NEON website. There's lots of really in-depth information about that. Okay, so I want to get into some of the detail and this is mostly going to be because I'm probably going to use some of these words and I just want to make sure that we're all on the same page and we know what I'm talking about. So when I refer to targets, this is the thing that we're trying to forecast and also the thing that's going to be, your forecast is going to be evaluated again. So in this case, our target is water temperature. And specifically for the lakes, now, lakes have a depth, some water temperature changes across that depth. But for the challenge specifically, we're looking for the temperature of the surface of the lake. So that is our target for today. The next important thing to think about is uncertainty. Forecasts are inherently uncertain and there needs to be some estimate of this in your forecast submission. You can represent uncertainty in a few different ways. You might represent this by doing multiple model runs. So you can see this example on the right here, this middle plot. We have multiple different potential iterations of future conditions, which gives an estimate of the uncertainty in our future prediction. Another way to think about this is through a distribution. And if you know what the distribution of your forecast will be, you can report the statistics of that, that includes the mean and the standard deviation, for example, if you had a normally distributed forecast. Today, we're also, as well as using NEON data, we're also going to be using data from the National Oceanic and Atmospheric Administration. They produce weather forecasts across the globe and the NEON challenge organisers have been, what's the word? Collating this data for the NEON sites to use as covariates in your model. So we have this consistent weather forecast data across teams. There are three data products available as weather forecasts. And these have been, the challenge of developed an R package to help you access these more easily. So the first two data products are our forecasts, so forecasts of air temperature, relative humidity, wind speeds, things like that, into the future. NOAA produces what we call an ensemble forecast. So these multiple iterations of future conditions and there are 30 different ensemble members. We have our stage one, which is the raw forecast produced by NOAA. And then we have the stage two, which is a processed form of stage one, which has been interpolated to an hourly forecast. Then we have what we call stage three, which is what I like to term a historic data product. This is what we call a stacked dataset where we've taken the one day ahead from every previous forecast and stacked them together. So you can see here we take the one day ahead from this first forecast in blue and stack it with all of the other one day ahead forecasts. One day ahead forecasts are very good and this gives a good estimation of observed conditions. So where you want to train and run your model using the same type of data, you can do this using the stage three and the stage two data. So the stage three is like a pseudo observation and it's really useful for any model training or calibration that you wanna do. Once you've submitted a forecast, this will go into our automated scoring pipeline and these produce scores, which is a means to assess your forecast skills. So we're evaluating the forecast against observations taken by, in this case, the sensors in the lakes. For the challenge, we use a scoring process called the Continuous Rank Probability Score. You don't need to know too much about how it works, but essentially because we're asking people to include estimates of uncertainty, the scoring rule uses both the accuracy, so how good your mean prediction is, as well as the precision or the standard deviation of your forecast. So if you're very close to being correct, your mean is very close to the observation, but you have a very wide uncertainty, you score less well than if your uncertainty was much narrower. And these scores are made available to users on our dashboard, which hopefully Quim will put in the chat. And so this is kind of what it looks like. So you can look at how your forecast has performed, you can look at other people's forecasts, you can see how close you were to observations, how well you perform on average. And the scores are also available for you to look at locally if you want, if you wanna grab those and do some evaluation yourself. Finally, we're gonna talk a bit about standards. And as I said in the introduction, another reason for developing the challenge was to help give kind of a consistent terminology and standardization across the field. And specifically in these automated pipelines, we need to have a particular set of standard formatting to ensure that the automation works. So for example, you need to submit a forecast in a standardized file format, and I used to have a standardized name. Within that file, it has to have a specific structure with specific column names, which I will go through now. So if you were gonna produce a forecast, this is what it might look like. We have our target, so these historic observations of whatever it is you're trying to forecast, in this case, we have the temperature of the water, which is our variable. And then we're gonna make predictions into the future. These predictions need to have some estimate of uncertainty. And in this case, we're representing our uncertainty using multiple different potential future trajectories. And we call these ensemble members. And the ensemble member is identified using its parameter number. So this would be parameter number one, parameter number two, et cetera. We also have this idea of a reference date time. So this dotted line here is the date that the forecast is being produced. So if this is a real-time forecast, that's gonna be today, we're making a prediction today of the conditions tomorrow. And so the reference date time would be today's date. All of our forecasts also have an associated date time. So we're making a forecast of the first of November, for example. And if you've worked with Neon Data before, you'll recognize these four character codes, which all the sites have associated with them. So this is for Barco Lake, one of our Florida lakes. If you were gonna imagine what this might look like in our standard format, in a CSV or a table type format, it would look something like this. So we have the date that we're making the prediction of. But it's a date in the past, so that is not correct. But the reference date time, if we're making a forecast, oh, I've got these columns the wrong way around. I'm really sorry. This red column is a reference date time. So this is the date that the forecast is being produced. And these are all being produced today. And this should be the date time. I'm gonna have to go back and change my presentation. That's really annoying. And your date time will go forward into the future. So you can see that the furthest ahead, we're making a forecast 30 days into the future, which would be the 30th of November. The site that you're forecasting, so in this case, Barco, the family and the parameter columns are two really important columns that describe the type of forecast that you're generating. So in this case, the type of forecast is an ensemble, which has these multiple trajectories of the future. And each of those trajectories is given a parameter value. So one to four, representing our four different ensemble members. The variable that you're making a prediction of, in this case is temperature, what the value of the prediction is for that particular time step, and the model ID for your team or your individual. So in order to make comparisons among models, we need to be able to differentiate them. And this is the model ID column to help us do that. So just to put this all together, what might a basic forecasting workflow look like? First, I would really encourage you, if you're interested in doing this, to read some of the documentation. The challenge organizers have done a really great job of putting together FAQs and help sheets to really get you started and understand more about the data, like where to get started and things like that. If you want to investigate the other themes and see how you might go about forecasting leaf phenology, you can have a look at the documentation for that theme. Then have an investigate the data, the targets variables that you're interested in. So if you are interested in leaf phenology, have a look at the targets, what does the data look like? What kind of covariates might you want to include in the model or the forecast for your variable of interest? Then you want to build and apply your model. So maybe you already have a model that forecasts leaf phenology and you've been using it already, but you want to put it into a forecasting mode. That would be your next step. Then we want you to produce forecasts of future conditions and submit these to the challenge, which is what we're going to talk about today. The next step is to register. So part of the workflow requires you to register so that we know who's submitting forecasts, where you're from, why you're doing it, et cetera. And some more information about the model that you're submitting so that we can do some kind of cross comparisons and evaluation within themes and across themes. Then wait for your scores to come in. So for the water quality variables, as I said, we're looking at like a two to three day latency. So a forecast that you submit today will have been evaluated by the end of the week, which is really exciting. So if you're thinking about like the iterative process of updating your model in just two to three days, you can have more information about how you perform, update those model parameters and like make your model better every day, which is number seven. Use the new data that comes in to update the model and submit another forecast. And this is a really important step, this iterative nature of forecasting. And one step that we're not gonna be able to cover today, but I will point you to some resources is this idea of workflow automation. You don't wanna be pressing go on your forecast every single day. And so using tools that enable you to automate the workflow that you've set up is really important. Okay, does anyone have any questions before we get into the next step? I can't see. I also really like this cartoon. Quinn, do you have anything to add? Did I miss anything? No? Nope. Okay. So before we move on completely, I'm just gonna give you a little bit of a prime of what we're gonna be looking at. Gonna be looking at an R markdown document. If you received the email class on out last week, information on getting that document, but we'll go through it again in a second in case anyone missed it. I'm gonna be using a really simple linear model to make forecasts of water temperature in the lakes using air temperature forecasts. So this is some air temperature and water temperature data from one of our sites. And you can see it's like roughly a linear fit. So can we use predictions of air temperature to make predictions of water temperature? We're gonna go through the whole workflow of like obtaining the data, fitting the model and then generating the forecast. And then there's additional tasks to complete in that markdown, including modifying the model and submitting your own forecast. There's also some tasks that you can complete, which take you through alternative model, forecasting approaches, which I'll talk briefly about. Or if you're interested in forecasting different variables, I can, we can talk about that as well. Okay, so if you didn't have the instructions, you can click them here or scan the QR code. And then I will briefly give you a minute or so to just navigate to that page if you haven't done already. And then I can talk you through some of the installation instructions. Okay, I'm gonna stop sharing momentarily. This is the link to the GitHub repo for you? Yes. Okay, can you see the GitHub? Yes. Also the meeting chat, there we go, go to that. Okay, so hopefully you will have navigated here if you haven't got the code already. If you have, well done you. That's great. Anyone had any issues with any of the package installation? There's like one key package that you need, the neon forecast package that's gonna give you access to the NOAA weather data. That sometimes trips people up a little. But if you didn't, that's okay. I'm gonna talk you through it really quickly. And if you wanna do this at a later time, if you just wanna follow along, that's fine. You can navigate back to this at another time. And there's like step-by-step instructions about how to set up the R environment, how to get the code that we're gonna run, and also some other options. If you're familiar with Docker, you're not, don't worry about that now. So the first thing you probably wanna do is open your R window if you haven't done already. Your R studio, I should say. And then if you wanna copy this installation instructions, we're gonna be using mostly tidyverse as our light base functional environment. There's some functions in LubriDate, which is working with date times. And then this is the neon forecast package, which installs directly from GitHub. And you'll need this remotes package to install. And so if you just copy that and you can paste it, show you, paste it in your window and just click run. I don't need to do that, mine are already up to date, but that's the first step. If you are not running R version 4.2, you won't be able to run the neon forecast package. So I would recommend updating to at least version 4.2 if you haven't done already. The next step, once you have the relevant packages installed is to get the code that we're gonna be running. I'm gonna talk you through the recommended function option, which is to fork. If you're not familiar with GitHub, I'll give you a second option in a second, but if you are, I would recommend forking the repository, which is to do that, you're gonna go up to the fork button. You see this one has a lock because I've done this a few times and you're gonna create a new fork. So I'll give you some options. You can change the name of the repository. This will give you the description that's already here. I already have one, so it won't let me copy it. And then you're gonna click the big green button to create one. This is gonna make a copy of what I have here in your personal organization. And hopefully you'll have all of these files and folders available to you. The next step is to clone. So we've forked and now we're gonna clone. And again, you wanna look for the big green button. Make sure it's on local. And you're gonna copy this HTTPS link using this button here, which will tell you that it's been copied. Once you've copied it, go back over to your R window, where you've just installed all of the packages. You're gonna go to project at the top, click new project, which will be the first one on your list. Did that work? There we go. Gonna use the version control option and it's gonna clone a project from Git. And we're gonna paste in that HTTPS link that we just copied. And this is gonna generate a directory or a file, a folder, sorry, in your local computer. And then you're gonna click create project and it's gonna bring all of those files from GitHub into your local R environment. Once you've done this and you've clicked create project, it should look something like this. You should have neon forecast challenge workshop or whatever you decided to name your directory, your project, sorry. Open the top right hand corner. You should have all of the files from GitHub in here. This will probably not be open. It'll probably look like this. And then to open that file that I just posed, we're gonna be working through the submit forecasts tutorial today. Do you wanna open that? You probably won't have these files either, don't worry about that. And you wanna open this one that's RMD, the R Markdown document. These will appear as we work through the materials. So don't worry about that for now. As long as you've got these two, and probably this one, that's fine. So anyone that's using GitHub and Git, hopefully you have this. If you don't have that, I'm gonna go back to this. So if you're not familiar with GitHub, you've not used it before, you don't wanna do it, that's absolutely fine. The other option is to click on the big green button again and instead download a zip file. This is gonna download into your computer wherever downloads go in your computer. You're gonna open that up and you wanna extract or unzip all those files and once they're extracted, you can see I did this yesterday. This is the same file. You can go into here and you wanna open up this R project file and that will then open up into your R window looking exactly like this. You should have neon forecast challenge workshop in the right corner, that's the project. And then again, you'll navigate to submit forecast and this R markdown document. Does anyone have any questions about that setup so far? The one thing I should note, if you've done the zip folder option, you won't be able to save any changes onto GitHub, what we call commits and pushes, you won't be able to do that, which might have issues later on, but it's not gonna have any issues with you following the markdown document initially. And if anyone wants to do that later and has like struggled now, feel free to email me, I'd be happy to walk through that with you again if you're interested. Okay, so this is the R markdown, the code is all there, it should all run, fingers crossed. But I'm just gonna talk you through each step, show you kind of a very simple forecast workflow that you can work through yourself, but also use and modify and personalize and make your own, change the model, change the variables that you're interested in and I'll show you some tools to help you do that. So this is just pre-setup, you don't need to worry too much about that. Some background on how this workshop came about initially. Some more information about packages you might need. If you're gonna run this whole document, you'll need a few extra packages than the ones I initially told you about, but only if you're doing the later tasks. So don't worry about that now. The ones we're gonna need right now are this tidyverse and the lubricate function packages, sorry. So if you're not familiar with our markdown documents, the way that you run the code, this little green triangle on the right-hand side of each, these are called code chunks, will run every line in that chunk. So you're gonna click Go and it'll move everything into your console and run it. So you can see here we loaded the tidyverse and the lubricate packages. It's also gonna put any output in the console into this little window here, which you can then close, that's fine if it's getting in the way. So this information here is basically the introduction that I gave you in the presentation. So if you wanna review any of the information I gave you, most of it will be in here. There's some more information about participation or signing up, there's lots of web links in here for you to navigate to the documentation part. So information more about the challenge. So we're gonna be working on water temperature today, but the challenge for this particular theme also covers oxygen and chlorophyll. And as I said, we're gonna be working on the lakes, but if you're interested in rivers and streams, there's 27 of those. So you've got plenty of that to work with. And you can forecast any combination of these things that you want. So if you wanna forecast oxygen in the 27 rivers and streams, that's great. If you just wanna forecast chlorophyll in two of the rivers, that's also great. Maybe you wanna work on the ones that are in your particular state. That's great, forecast whatever you like, forecast them all, that would be great. But if you have particular interests, then don't worry about forecasting everything if that's not what's interesting to you. We're looking in for the aquatics challenge, specifically for daily forecast. So we want daily predictions, at least 30 days into the future. You can make longer predictions if you like, but at least 30 days. And we take new submissions every day. So we'll submit a forecast today, but if you wanna update your model and submit another forecast tomorrow, maybe a new data comes in, you update the parameters slightly, submit a new forecast the day after, that's great. Today we're gonna focus on the lakesites, as I said, and we're gonna start with water temperature. Some of the submission requirements, spelling error. We talked about in the introduction, this idea of the standards and submitting things that are in a particular format. There's more information about here. I will talk about that as we get to that stage of the submission process. Okay, so here we are, the forecasting workflow. So the first step, we need to know what we're forecasting and we need to know what that data looks like. So the first step is reading in the data, looking at this historic data or the targets as I introduced in the presentation. And these data are available with a latency of like two days, let's say, usually. And these have been cleaned and compiled and kind of standardized across time steps and things like that to generate our targets which are then available from the FE server. So these are on a, what's the word? The CSV link never changes, so you can consistently look for the aquatics targets at this location. This dot GZ is just a compressed CSV, so it's just to keep the file size a bit more manageable on the FE server, so don't worry about that. But you can read in the CSV just using the regular read CSV function here. So we're gonna click the green button. It's gonna read it in. It's gonna put it into our environment. We can see here that we have 173,000 observations across three different variables. No. So the next thing we might want as well as our targets is to know what sites we're forecasting. So I told you already that we're gonna be forecasting the lake sites, but you guys might not know where those sites are or what kind of systems they are. So we can also look for information about the sites. And this has been compiled on, this happens to be a GitHub link, but again, it's just a CSV that you can read in. We're just interested in the aquatic site, so we'll fill to that for where aquatics equals one or true. So these are ones and zeros, whether it's aquatics, terrestrial, phenology, beetles, ticks, it tells you which beam the site is present for. So again, we're gonna read that in. It'll put it into our environment. We can have a look in a second. So what do the targets data look like? If you run this one here, it's gonna give you 10 of those lines. And you can see that we have very similar column names to the ones I showed you for, the right exactly the same column names, I should say, as the ones for the forecast that you're gonna produce. So it has a date time. It has this four character site ID. It has the variable, which the observation is for. So in this case, oxygen and chlorophyll. And it has the value of that observation. And you can see some of these are NAs, but we have observations all the way back to 2017. Yeah, that's what the data look like. And then the site information, if you're interested in learning more about the sites before you start forecasting them, maybe you wanna look at sites in a particular state or a particular latitude. This CSV is a really good place to start. So it has the site IDs that are associated to the full names, which challenge they appear in, which theme. So I just looked at the aquatics ones, which is why we just have aquatics ones. What type of site it is. So we're looking for lakes, but we also have waitable streams and non-waitable rivers. And you can read more about that on the Neon website to learn more about how these sites differ. Some information, you can look for the other website, information about access, what domain they are in, their last year, longitude, all kinds of information that might be helpful when you're thinking about forecasts. The next step that you probably will want to do is to visualize the data. We wanna, it's a good idea to have a look at what the data look like, rather than just looking at a giant table of 173,000 rows. That's not very easy to understand. So this little code chunk here is just gonna change that into a, those scatter plots that we saw in the presentation to get an idea of what the data look like historically. So we're making a plot of temperature, oxygen, and chloroponition. So we've generated three plots here. Oh, I missed a chunk, okay, sorry. What we're also gonna do is filter our sites by subtype. So we're just looking at the lakes and we're gonna do the same thing for our targets. So that's why it took a little bit of time. We got this bit. We're gonna take just the target data from our lake sites. So that's gonna make our target data frame a little bit more manageable for this particular workshop. So we're looking at 25,000 rows. And again, we can plot it. And it's gonna generate those three plots. Now just looking at our seven lake sites. So we have chlorophyll data, oxygen data, and temperature data. So if I just minimize this, you can see here we have different durations of data. So some data observations start in around 2017. Some don't start until mid-2018 and Prairie Lake. Observations don't start until mid-2019. And also it's worth noting for the lake sites that the majority five out of the seven sites, their sensors get removed in the winter period because of ice cover. So we have these gaps in the winter. Where the sensors are removed. The only lakes that sensors remain in Barco Lake and Suggs Lake, which are both in Florida and do not suffer from ice cover. So this is kind of a good place to start looking at the data, thinking about how the sites differ, what kind of covariates you might wanna use to make predictions of these variables. So again, we have temperature. We also have observations of oxygen. If you're interested in making oxygen forecasts, having to dig into what the oxygen data look like, when is it available? When is it not available? What are the fluctuations like? And the same for chlorophyll A. And just to reiterate these observations, these targets are just from the surface of the lakes. But today we're gonna focus on temperature as a starting point. So we've done our visualizations. The next step is to think about what types of models it might be useful to make predictions of these for the future. So one way that you might think about this is to think about what is happening in the system right now. Usually a good predictor of what's gonna happen tomorrow is what's happening today. And a model such as a persistence which says that tomorrow is gonna be the exact same as today, is a useful place to start. And if you're interested in these types of persistence models, there's some information about those. It's model two in this document. Another way to look at it might be to think about how a variable has a relationship with another forecasted variable. So how does information about what might happen to air temperature tell us about what's gonna happen to water temperature? And this type of model would use air temperature as a covariate, which is what we're gonna think about today. And finally, you might wanna think about what the historic data tells us about what is normal for this time of year. So likely October, November time last year is gonna be similar to what it's gonna be like this year. So using the historic data or what we call a climatology model is another useful part, another useful information for making forecasts. And if you're interested in those types of models, model three on this document, if you're interested in climatology or sometimes terms seasonal and naive models. But today in this little workshop, we're just gonna focus on a linear model with covariates. And we're just interested in temperature. So I'm gonna filter out the rest of the oxygen and the chlorophyll data and just focus on those temperature targets. So again, our targets data frame here has been reduced down to 8,000 rows across our seven different sites. Just still a good amount of data for us to work with. So these covariates that we wanna use in our model, where might we get these from? Now, helpfully the challenge organizers have already started to compile a lot of this data and to make it available and really reduce that as one of the barriers for producing forecasts. As I said in the introduction, there's these three different data products, stage one, stage two, which are forecasts of weather. And then stage three, which is this estimate of historical weather or kind of pseudo observations. We're gonna focus on just using the stage two and the stage three data today. And we're gonna use functions from that neon forecast package. So if you had any issues installing that package, you might not be able to do this section. But I'm gonna talk through it and you can follow along if you have issues. So what we're gonna do is we're gonna generate this, we're gonna generate a forecast of water temperature using air temperature as a covariate. So we're gonna access some NOAA air temperature forecast and some NOAA air temperature historic observations to generate and calibrate a model and then run the forecast into the future. So the first step we need is some historic weather data that we can match with our historic water temperature data. And so to do this, we're gonna use the neon forecast package. And inside that package, these two dots mean use this function from this package. So we're gonna use the function NOAA underscore stage three and we're gonna assign that to this object here. If you wanna go ahead and run this, depending on your internet connection, it, whoa, not fun. This run when I run it 20 minutes ago. Quinn, any ideas? We've never seen that error before. Try restarting. Sorry about this, guys. This 100% worked when I run it an hour ago. So you run all of the previous chunks. I think I have a workaround. That was very long time ago. Is there something on the back end? That's not. No, but another history bucket that we have folks were having issues yesterday that I couldn't figure out either. So, but you got it working an hour ago. So I don't know. It worked perfectly. It looks like at least a couple of the participants it's working for. Oh, okay. It just worked for me. Okay. Do you wanna share your screen, Quinn? Sure, keep talking through. Okay, so essentially what this is doing, hopefully you guys have got it working. I'm glad it's just me and not everyone else. That would be a worse way around. It's gonna create a connection to this S3 bucket where the data is stored. And instead of bringing in all of the weather forecast data for all of the sites. So we have 81 neon sites and there's like seven or eight different weather variables that you can get information about. Instead of bringing all of that data locally in like one big chunk and then you're having to filter it. What it's gonna do is create a connection to this remote bucket or database and allow you to filter remotely before bringing the data that you need into your local environment. So in this case, it's gonna connect to that stage three data, that historic data product. Then we're gonna generate this vector of the variables that we're interested in. So if you want additional variables, maybe in there you'll also put like relative humidity or wind speed. You can find out what other variables are available here. And then we're gonna filter this connection based on this vector of variables as well as the sites that we're interested in. So in this case, it's just the lake sites. We're just interested in air temperature and we only want historical observations since we started collecting data at neon. So what it's gonna do is it's gonna filter and then the final step in order to bring it locally. So you've queried that database and then the final step is to use this function collect from dplyr and that's gonna put it into your global environment. I won't be able to run that function because I don't have this particular object. But if you did, you should generate a data frame which has columns with date time, the variable name, the site ID, and the prediction. If you let me share my screen, I would pull up. You can talk, I can just scan. Can I request your? There you go. Okay, nine, yeah, okay. So this is what the forecast is gonna look like. As I said, we have date time, the site ID, some information about where the forecast is from. The same column names that I showed you in the introduction that you need to submit a forecast that also present in our NOAA forecast. So we have this idea of an ensemble. So the weather forecast that's produced by NOAA is also an ensemble forecast that has these multiple potential future scenarios where the prediction was taken from. So it's like a standard two meters above the ground height for weather data, and then the prediction. So you can see if this is an air temperature, this is in Kelvin, which is like the standard SI unit of temperature. And you may or may not wanna work in Kelvin, you may wanna work in Celsius, so which is what we're gonna do. So the next step is to take this ensemble prediction and we just want like a single line time series that we can match with our water temperatures to generate the model. So what we're gonna do is this is some aggregation. So instead of having this ensemble with the multiple different parameter values, we're just gonna have one mean prediction. So we're gonna group by date, site and variable, and generate a mean prediction, which is what happens here. This is crazy that I can control your computer from my computer, but anyway, here we go. And so this is gonna produce this past mean, which will be now in your environment up here. And instead of having lots of different columns, we now just have three, which have a date time, a site ID, and then a column called air temperature, which was produced when we did this pivot here. So we've taken it from a long format to slightly wider, where we just have one column for air temperature. The next part of the data that we're gonna need is the stage two. So we'll use stage three to fit our model, and then we're gonna use stage two to run it into the future. So we do the same type of data manipulation and accessing of the S3 bucket. We're gonna need a few other vectors. So the first one is the date, and if you're making real-time forecasts, the date that you're making the forecast is today, hopefully. One thing to note is that the NOAA data is only available the day after the forecast date, because it takes so long to run, it only becomes available the day after. So if you wanna make a forecast today, you can't use today's weather forecast, you have to use yesterday's weather forecast and run it into the future. So we're gonna take the day before today as our NOAA dates, the date of the NOAA forecast. And again, we're gonna use this neon forecast package and this time the stage two function, and we're gonna use the argument start date to collect the NOAA forecast from a particular date. So we want the date, the forecast generated yesterday, which we assign here. We again just want air temperature, and we're gonna use the same type of syntax to filter the connection that we've generated in this line on 209, and we're gonna filter by the date, the site ID, and the variables of interest. And again, the final step is to collect and hopefully this will have also come into your environment you can see this is quite a large data frame because we have to reiterate again that the NOAA forecast is this ensemble. So we have for every date and every site, we have 31 different iterations of the future. We're interested for the challenge to make daily predictions of water temperature. So we need daily predictions of air temperature to make our water temperature predictions. And so we're gonna do a bit more data wrangling just to get in the format that we need. We're gonna take that large data frame, which has all these different columns, sorry, the predictions, the variable of interest, the reference date time, so the date the forecast was generated, which you can see is yesterday, which is what I specified, the NOAA forecast that I wanted. And so what we're gonna do is we're gonna create into a date time so that we can wrangle it a little easier. We're gonna group by date time, site ID, parameter and variable. And what this is gonna do is it's gonna retain those individual ensemble members so that we can maintain the uncertainty in our air temperature forecast that we can then move into our water temperature forecast. So it's gonna make that mean prediction from the hourly down to the daily. It's gonna do that pivot again. And then a little bit more mutation, we're gonna convert it from Kelvin again into Celsius. Did I run this already? I don't remember, I'm gonna click it again. And it's gonna generate this data frame with a future weather forecast that a daily time step with 31 different ensemble members for every date. So you can see that the forecast yesterday produced of today has 30 different ensemble members that vary in their prediction of air temperature from kind of 21 up to like 23 and a half. And so you can see that there's uncertainty in what the air temperature is gonna be. Now we have our future weather and we have our historic weather, we can start to build our model. It's a good idea to visualize the data again, just to check what it looks like. So you can see here we have the blue line is our historic observations and the black is our ensemble prediction of the future. So we have this uncertainty, this cone of uncertainty increasing as we go further into the future. The next step is to build our model. So we wanna use our target water temperatures and our historic air temperatures and generate a linear model between these two covariates. What I'm doing here in this code chunk is just kind of making it and you don't need to do this step but just getting all the data in one place I just feel like it's easier to look at. So you can see for every date we have a water temperature and an air temperature and can we make a model that fits these one against the other? Which is what we're gonna do. To do this, we're just gonna use the base RLM function but there are lots of different methods to fit linear and non-linear models in R. One way, one package that's really, I found really useful is this Fable package which is specifically for forecasting. There's a linear model function called TSLM time series linear model but there's lots of different forecasting approaches in there and I would recommend looking at this documentation if you're interested in these regression type functions. We're gonna start off just forecasting one site just to show you what the process is and then I'll show you the next step. So we're gonna start off with Suggs Lake, one of our Florida lakes, lots of data, lots of, we should be able to fit a model nice and easily between our water temperatures and our air temperatures. So we're gonna forecast Suggs, we're gonna filter just to get targets data for Suggs Lake, we're gonna do the same thing for our future data and then we're gonna fit the model. So we're using the LM function and then we have our dependent variable by our independent variable and that's just gonna fit a linear model between those two variables. And then from this model, so this generates a linear model with two different parameters. It has an intercept like a straight line, has an intercept and a gradient. And so this is the intercept at negative eight and our gradient of around 1.5. And so we can use those two coefficients to estimate our water temperatures. So we take the intercept is the first coefficient here and we add it to the product of the second coefficient and the air temperature which we take from our future NOAA data that we grabbed earlier. And from there, it's gonna produce a load of water temperatures that look like that which is not super helpful to look at. And so we'll put that into a nice table in a second. Now, if we wanted to do that for every site you could copy and paste that code, you can make it into a function if you wanted. But what I'm gonna show you is just a really simple for loop where we take every single site. So for each of our sites, we're gonna do the exact same thing that I just showed you and fit the model, generate the water temperatures. And then here we're just gonna put it all into a nice little table or data frame with the date time, the site ID, the parameter value, that ensemble number, the predicted water temperatures which we generated using the linear model and then give it a variable name. We'll bind it all together using this bind rows function and then it'll just tell us that it's done. So if we click go on that one, you can see it ran through each one, it fitted the model and generated the data frame. So now we have all of these different potential future iterations of water temperature based on our air temperature prediction. And here it is. So you can also expand this window using this little one here if you want. But you can see that this is similar to the figure that I showed you in the presentation where we have past observations. We have this idea of a reference date time which is the date that the forecast was generated. Our prediction of water temperature based on those air temperature predictions, the date time for that prediction, the value of the forecast and each of the individual sites are shown differently. So you can see here that potentially in the next like 10 days, it's gonna get a little warmer at Barco and then it's gonna, it gets quite uncertain about what the water temperature is gonna be. You see those sites are all like slightly different. So there we have it. We have a forecast of water temperature. The next step is to make sure that it's in a format that is in this standardized way so that we can enter the automated pipeline and just returns consistency across forecast essentially. So we have a date time, this idea of reference date time when the forecast was generated, the site ID, the type of forecast that you're submitting. So for this one, we've used the ensemble members from NOAA to produce an ensemble forecast of water temperature. The ID value for each of those ensemble numbers given by the parameter value, what the variable is that you're predicting, the value of that prediction and then you need to give it a model ID. So the model ID that for this example is example ID. If you go ahead and make modifications to the model or you decide to submit your own model, you'll need to give it a different ID so that we can differentiate the examples from actual submitted forecasts and you'll also need to register. So we're gonna take the forecast that we generated, we're gonna do some slight mutations, just get it in the FE format. And then the last step, now we have a nice, where's the, this one, the FE format. We have a data frame that looks like the one I showed you in the presentation, the date time, the reference date time, site ID, family, blah, blah, blah. All nice and tidy, ready to go. We need to submit it. And again, the challenge organizers have included a function in the NEON forecast package to help you do this. There's a couple of functions that will help you in your submission step. The first one that I like to point out is this forecast output validator. This is really helpful when you're first starting out to check that you've got it in the right format. It'll take your file, it'll check you've got the right file name, the right column IDs. And if you haven't, it'll tell you where you've gone wrong and you can make changes to improve that. Then we're gonna use the submit function and that's gonna take your local file and it's gonna put it into the S3 bucket where it'll get evaluated automatically. So the first thing we need to do is save the file. Currently it's just in our studio window. It's not anywhere else. The file name has to be in a standardized format. The first thing you need is the theme. So we're submitting for the aquatics theme. The next thing is for the date. So this is the reference date time, the date of the forecast generation. We're gonna paste these together into a CSV and did that run, I don't know, this one. So now we have a file that was an aquatics forecast that was generated today with a model ID of example ID. I'm saving it as a CSV. Nice and easy. I like to save my forecast into a forecast file that'll generate a little directory here just to keep them all organized away from your markdown documents. We're gonna write it into that forecast directory. We can have a look in here. There it is. And then use the validator function to check that all of the things are correct. So as I said, it's gonna look that your file name is correct. The variables that you've said are in there or in there. Does it have a reference date time or not? Yes. And if it comes up true, that means you did a good job. The forecast is valid and you can continue to the submission step. So the final step is the submit. Again, within the neon forecast package. And there's one argument I just wanna point out in this. So you need the forecast file. Where is that file? It's right here. Make sure you specify in the right place. Then there's this ask argument. If you set that to true, it will generate, I don't know what Quinn's got set up here. It'll generate a little pop-up or this little question at the bottom. Do you wanna submit this forecast? Are you sure? Yes, so you really wanna submit the forecast. And so, which is really helpful when you're starting out, you wanna make sure that yes, it did get submitted or not. But if you're setting up an automated workflow, you don't wanna have to click yes every time. So you can set this to false and it will not submit your forecast. Quinn, what have you got set up here? I think Quinn has some AWS. Oh, I have a different set of credentials. I have conflicting things on my computer. Yeah, but if you have no other S3 bucket credential issues, that should all say yes. It says it's the right format. You've done everything correctly. You just have your credentials, not correct. But hopefully you've managed to get through to that step. So thinking about like next steps and how you can move forward with forecasting, is this linear model any good? Do we think that using air temperature to predict water temperature is the best way to make predictions of water temperature? Maybe you have a process model that also uses solar radiation or wind speed to induce mixing or things like that. Maybe a machine learning algorithm that takes like the last seven days and fits some kind of, I don't know, general additive model over it or some kind of black box model is gonna be better. So just to kind of illustrate this, this is the fit between those two variables. So for some, maybe it's doing a good job up here, but there's still quite a lot of spread around that line. Maybe you can use the uncertainty from this model fit to increase your uncertainty estimation. So there's like lots of ways that you can think about modifying the model that we've given you are improving it really to make better predictions of water temperature. So there's some tasks included here. Maybe you wanna include additional NOAA variables in your linear model. Maybe you wanna induce a nonlinear relationship between these variables, trying to forecast oxygen or chlorophyll in addition to or instead of water temperature. Maybe you could use your water temperature forecast to make a prediction of dissolved oxygen. Or maybe you wanna include a lag in the predictors. So maybe yes, today's air temperature is a better predictor of today's water temperature rather than today's air temperature. So just some ideas to get you started. And if you do make substantial changes to the model and you wanna continue to submit forecasts, remember to change the model ID. So information here about registration, but you don't need to worry about that just yet. If you do have questions about that, please feel free to email me. I'd be happy to talk you through the registration process or if you're interested in submitting a different model, but you're not really sure where to get started. I can help you with that. And if you are interested in learning more about forecasting approaches, as I said, the Fable package is a really great resource. They've done a really good job of putting together the information. It's got a really nice like online textbook that I really liked when I was starting out with this. So if you wanna learn more about persistence forecasting or the climatology forecasting, there's again, step-by-step instructions about generating those forecasts. There's a couple of additional packages that you'll need, including the Fable package, but that's all available for you in this Markdown. I think that's everything. Does anyone have any questions, comments, queries? Did someone's thing not work? Does anyone need troubleshooting? That sounds really aggressive. That's not what I meant. But if anyone has questions, feel free to put them in the chat or raise your hand or email me at a later date. Let me get back. Can I put my presentation up? I can show you my email address. Yeah. Let's go with this one. So yeah, you can email me here. Have a look at the EcoForecast website to learn more about FE, the Ecological Forecasting Initiative. The neonforecast.org has all of the documentation about the challenge. If you wanna learn more about the other themes. So the tools that I've shown you today about like accessing NOAA data and accessing the targets files are all gonna be the same depending on which challenge you do. Big shout out to Quinn and Carl who's like all of the cyber infrastructure they've developed to make this possible. And yeah, please feel free to email me if you have questions or wanna get involved.