 We're going to be going through everything you need to know about a downloading neon eddy covariance data and then working through a little bit of data munging and then plotting that up and kind of evaluating the data, looking at the data in different ways. So, first just looking at the time series data, but then also bending and kind of looking at it from a ideal carbon cycle standpoint. And so we'll start off by I'll be live coding today so I'll be typing as we go and if you in case you want to follow along. Let me share my screen. Okay. So, I'll just be typing as we go. But I also wanted to point out that you can grab this tutorial from the web and follow along as well. So if you are interested, you can go here to the neon science tutorial page under Eddie deal cycle and you can follow along. Say if you're not at your, if you don't have all the dependencies ready to be able to code and if you would just rather download the code and follow along, you can do that as well. Just by clicking on the link at the at the bottom of the tutorial page. Okay. So, the first thing we'll want to do is get started by preparing our environment so we'll, we'll download some packages. And then we'll set up our environment to be able to do the analysis we're going to make. And then we're going to set up some arguments to be used for the neon utilities package to download some data from any covariance data for us from the neon API and stack that that data together so we can use that for analysis. We'll start with that first and then I'll kind of backtrack and give a little bit of overview about the products but because we're downloading a fair size of data. I think we should go ahead and get that started now in case it takes a little while for it to download. So, the first thing we'll do is set up the packages that are required for this analysis so I like to set up a variable for that packs right. And then we'll just put all the packages that we need into the specter variable. So we need the bio conductor manager, because we're going to use the RHD of five, our package which is hosted on bio conductor. And then we're going to add the neon utilities package for downloading and stacking the neon data. And then we're going to add a couple of additional packages gd plot to tidy verse and lubricate for some of the data munging and exploratory plotting that we're doing, we'll be doing. And now that we have our variables, our packages into a variable, you can just press control and enter to run this to the console. And then we need to install and load these packages. The way I like to do that. And I use this pretty much across all of the workflows that I have is I'll do a L apply across this packed wrecked variable, and then create a function to allow it to wrap around all the packages that I have in my pack wrecked variable. Additionally, you can use this require function to determine if it's already loaded, which is a really nice it can save you a lot of time and prevent you from unnecessarily re re downloading a package. And the way to do that is you just put require and now that we have the pack wreck as our x variable and our function, we say require and then we'll just do character only equals true. And if this equals false. It'll mean that the package is not installed. And in that case, we will install packages and install the package that it's currently on in the loop. Okay, and then lastly, we'll load that library so that we'll have it available to us while we're coding. So now if you into that it'll check for all the packages. And install everything that you need for the analysis. The next step in preparing our environment will run options strings. As factory. Equals fall. And this just will ensure that when we read in data, it won't do any kind of automatic transformation to a factor for us. And this can be important when we're reading in attributes associated with the data in the HDR in the any covariance HDF file. Next. We'll get all the variables we need ready for the zip by product. That's part of the neon utilities package. So, I like to create variables for each of these to make it easier to modify my workflows later. So we'll start with a start date. And this comes in the format of just the year and then the month. So, for this tutorial will do 2022. April of 2022 to the end date, September, 2022. So we're going to focus on the growing season. So that we can kind of evaluate the main pulse of the, the carbon cycle during a year at two of our sites. So, and we're going to evaluate the differences between these two sites. So what we're going to focus on are the Stiger Walt SCI is the neon code and the tree haven site. And this is a nice pair of sites to focus on because they're located relatively close together with like just over a mile apart from one another. But the differences are more related to the vegetation and the management as opposed to say differences in the climate that the two sites might be experiencing. So now that we have the site and the start and end date. We'll define our directory for downloading the data. So the, for this tutorial, I'll just use the temp directory that are provided to our studio provide provides a temp directory associated when with our, if you just define your file directory as the temp directory. And it'll automatically download that. And then after your session is completed. All the input data files will will be cleaned the cash will be cleaned after you close out of our so the nice benefit of using the temp directory. So then, lastly, where we'll start the download so we'll use by product and the neon utility package. And then we'll see that we'll need to define the data product ID. The sites that we want to download data from the start and end date. And then which package that we're looking to download for for this tutorial will just be downloading the basic package for the covariance data. And then we'll save a little bit of size, because the eddy covariance data is already a fairly large data set. If we were to download the expanded files that they come in as a daily HDF five, and they have some additional functionality such as footprint matrices and extended quality metrics to allow you to evaluate the data quality. And none of those are necessary for today's tutorial. Additionally, there are other arguments that you can define for the zips by products such as the release. If you're worried about size you can you can check the size of the files before downloading. And then you can add a token to the neon API which really helps with with download speed. So we will start by defining the data product ID. For the neon eddy covariance files that data product and for the high level flux flux data products. The data product number is BP four. And I'll show you where to find this information on the neon data portal as well after we begin this data download. So, as mentioned, we will be downloading the basic package. And now we will just map our variables that we created earlier to the argument to the function to start date equals that date and indeed equals. Change our safe path. To the file directory. Do your file. And then lastly, we will fit check size default. In case you're worried about the size of the files, because it's going to be around 1.3 gigs of data that will be downloaded to your computer for this tutorial. So just be warned. And make sure that you have the resources available for that. And so, now we will start running. This. Sorry, we have to run each piece first. So we'll set our start date that are in date. Our site. The file directory and now we'll start the product. So, as you can see, it's about 1.23 gigs. So it can take a little while for that to download. So while we're waiting, I have a couple of slides to kind of just give an overview. I'm sorry for just jumping into it, but we wanted to get that started so we can dive a little deeper into the data in a little bit. So I'm going to give an overview now of the neon surface atmosphere exchange data product. And we'll start with a little bit of a background about neon so neon with is supported by the National Science Foundation. The program is was built to to monitor continental scale ecology. And is operated by the tell the project is set to operate for for a 30 year time frame, and all the data and samples are are free and an open. The observatory is to monitor the drivers and responses to ecological change using a standardized framework for research and experiments to add on to. And then ensure that our data is interoperable for integration with other national and international network skill science projects. That's the kind of operational standard that we're looking to live up to. And so that went into the the neon observational design. So we have 81 build sites. 47 which are terrestrial sites. And 34 aquatics. We have it's separated into 20 ecochromatic domain. To be able to monitor ecological change across different ecosystems. And as mentioned there will be operating for for 30 years and producing over 180 plus data products. Now kind of give a little background about the components of neon so we have the terrestrial instrument system, which comprises of the neon tower for monitoring micro meteorological data. And then we have soil center arrays. To monitor monitor the physical properties of the soil. Additionally, we have the terrestrial observation system where where we have field scientists going out and collecting samples and collecting data in the field. Additionally, we have the neon airborne observation platform, which flies the sites on nearly an annual basis. And that collects hyperspectral data, LiDAR data and RGB. And additionally we have aquatic observational and instrumented components as well. So if we zoom in and focus on the terrestrial instrument system, which will be utilizing to do an analysis today. You can kind of see here that we have the tower and the instrumented hut. The instrumented hut is kind of where what we call the location controller that collects all the data from the different instruments is located. We have the tower with all the the atmospheric variable instruments and then the soil array with five replicates extending into the main wind direction or the footprint of the tower. Additionally, at our core sites we have a double fence international reference that houses a way a weighted precipitation gauge to kind of give us a high standard precipitation value at the site. So how do we measure surface atmosphere exchange. So at neon towers, the surface atmosphere exchange system primarily consists of the turbulent components and then the storage component. You need both the turbulent and the storage components to get to the net surface atmosphere exchange. And so the the turbulent system is comprised of the 3D sonic anemometer and the infrared gas analyzer here that you can see at the tower top. We also have profiles of CO2 concentration and H2O concentration going down the towers, as well as air temperature that's measured by thermocouples inside of the radiation shield that allow us to calculate the storage component. So the profiles are used for the storage calculation and the eddy covariance system is for the turbulent components. And as you can see, having 47 terrestrial sites, there's a lot of different ecosystems that we're monitoring. And that requires us to have different size towers to be able to ensure that we're, we're capturing the data that we need to, for instance, a very tall site like wind river we have exceptionally tall trees with the Douglas furs that are growing at the wind river site. Whereas at the grassland sites or the tundra or or agricultural sites. We don't need to be that far off the ground, just high enough to be able to measure in the in the well mixed layer using the eddy covariance system. So the data that we're going to be evaluating today. The flux data is all packaged into the neon HDF five file format. And so it's not familiar with with HDF five files and it allows you to kind of create a directory structure inside of a single file. And it also allows you to be able to put attributes on to this directory structure so that you, you can have all the metadata associated with the data in the same file. Another nice aspect of the HDF five files is that we can put an object description and a reading inside these files that can kind of help direct our data users on how to find the information that they're looking for within the files themselves. So here you can kind of see the directory structure inside of the HDF five file. And then at the top level here if we click on the site level, we can see the metadata associated with this so we have the, for instance, the canopy hide, the displacement hide. The measurement level height and the lat long ecosystem type so all the information you would really need to be able to work with the data. Lastly, the, there are four levels of data products inside these bundled eddy covariance files. Level one is just kind of our main statistics. So you can think of it as the, the mean ratio mold drive CO2 so the, like what is how many parts per million of CO2 are measured at any 30 minute period. And then, as we go up to level two, we have time interpolated data products so we use this mostly with our storage data products so all the profile measurements kind of get passed through to this level two data product. And then level three we spatially interpolate the storage measurement, the profile measurement. So that in level four we can calculate, for instance, the flux, the storage flux of CO2 in this level four products, you can see we we just have the fluxes where we're combining the measurements from from level one so we're combining multiple sensors and multiple data products to kind of get these flux measurements so that's why they're designated as level four products. You can also see here that we separate data from quality metrics within this HDF five framework. And that makes it kind of easier to be able to do an analysis with the data and keep separate the quality flags for when we need those for either removing data or determining if the data is a standard that we're looking for. If we dive down into the flux products you can see here that we have the NFA, NFAE component, which is the combination of the stores in the turbulent components. And we kind of have them all separated here because so you can evaluate each of these separately. And for instance if the storage component has some data issues for some reason you can still use the turbulent fluxes and vice versa. And so lastly, this is kind of what we'll be doing today is working through, you know, downloading these data and evaluating them using the Neon Utilities R package. And here I've added the links to an introductory tutorial. And then the tutorial that we'll be working through today. So now we should go back to R and see that our data has been downloaded. So now that we have all our data downloaded. And we can check that to here. So we can just check that data are there by saying with the file and then save it first equals true and we can see that all of the files that we downloaded for each month are now available to us. So the next step is to stack the data so we'll create a variable called flux. And we'll use the Neon Utilities package and the stack eddy function. And you can see here for stack eddy we'll need to provide the file path to where these files are. And then the level of the data that we want. So in this case we're just interested in the fluxes so we'll do dp4. And if you're working with the level one data you can define the variable or the averaging period, but that's not necessary in this case. So now we will, so we'll give it a file path, we'll paste your file. So Neon Utilities automatically creates this sub directory when you download data using zips by product. So you'll need to make sure that your file path includes that. And then we'll just define the level as dp04 and we will run this. So this part of the code can take a second as well as it has to unzip the data. If you were to download this data from the portal, you have to unzip the data and then inside that you have some gzip files so you have to unzip them as well. And then before you're able to kind of look at the data. And so I can kind of show you though, while we're waiting on the stack ID function to work. What a downloaded file looks like in. So this is the one of the Neon HD5 files. You can kind of see here at the top level. We have the site ID. And if you click on that, it'll show us all the metadata associated with the site level. And then if we dig in and look at the dp04 and the data product, we can see that the units are associated with for each data are linked at the table level. So for instance here for the turbulent data, you can see that we have the micro mold of carbon per meter square per second. And if we open this data, you can see the time and then the fluxes. And this is an expanded file so we have the flux which is the same as the flux corrected. So we do some additional corrections using our validation system to give us the corrected fluxes. And then we have a raw flux that is uncorrected. So, this is only available in the expanded file. Today we'll be working with the basic file so we'll only be working with the corrected fluxes. I just wanted to give you a little. So you're a little bit of the hd5 file format. And now we have all our data stacked into the this flux variable. And if we look at what we can see that we have a list with separate list under that for each site. For the variable, the object description and the issue log associated with this data product. So if we start with the variables, we can kind of see that neon utilities gives us a comprehensive list of the variables that we have in our flux data, our flux list. Including the categories of the data, the system so feel to flux. And then the variable itself so whether it's in a storage, along with the actual variable we're looking at so that whether it's the flux. And then we have the time began timing and the units associated with with the data. Additionally, you can see that we have the qsqm here so in the basic file, we only download the final quality flag, which gives us overall assessment of the data quality. If you want to do a more refined analysis of the quality of the data. You can download the expanded files. And you can you choose which flags you want to apply to the data. Okay, so now we're going to do a little bit of data munging to be able to explore the data. So we'll start by creating a variable inside of each data frame for site. And we'll just set that to the site that we're looking at. We'll do the same for tree haven. Okay. So now we'll combine these into a single data frame, which will allow us to do some more. I guess advanced analysis and plotting. So, would you be a flux is equal to and will are bind the two data frames together for each site. Additionally, that we have a combined data frame. With our site. And all of our flux data so flux CO2 flux H2O momentum flux and temperature flux. Additionally, we have some footprint statistics that are also a level four product. Any covariance bundle. Next, we're just going to plot the time period. And during this tutorial be using a piping mechanism from the deep wire package to allow us to kind of do some, some data munging steps in sequence. We'll start by defining the data frame that we're going to use. And then we will type that into a gg plot. So in gg plot you have to define the aesthetic so the AS variable is always always consist of the variables that you want to plot. So we'll start with and begin and data. What's your to and we'll start with by looking at the turbulent flux. As it's the primary component of the flux. Usually 80 to 90% of the flux that during the day so. And then lastly, we'll want to look at the quality of the data. So, here as I was kind of talking through. We're using the deep wire package. And we're, we're sending the data frame that we developed that we put together the flux into the plot function. We're defining the data or the aesthetic. And we're just going to look at the time series here. So we see them time begin. And then we add a color associated to those data points. As the quality, the final quality flag associated with the turbulent to flux. So once you define the aesthetic then you define what kind of geometric object you want to use and we're just using a point for those. And then lastly, we just define the color scale. And then you can also do some really nice things. So part of the reason we added that site variable and combined our data set is, for instance, we can facet, or create a grid of plots, based off of the site. Here we can kind of see both the ciderwalt and the tree haven site next to each other. With our final quality flags highlighted in different colors so a zero means that the data are good and have passed all the quality flag and quality metric tests. And a one means that it failed some quality metric test along the way. So inside the tutorial page we kind of give a background as well about how some of the quality flags and quality metrics are derived with some links to additional information, such as theoretical basis documents and things that we use for calculating these final quality flags. So next we'll just summarize the quality metrics associated with the turbulent flux and you can kind of see that at stagger while around 22% of the data are flagged and at stagger tree haven around 27% are flagged and you know this takes into account. There's a lot of different components so there's there's sensor diagnostic flags, there's plausibility test associated with the data. Additionally, there's theoretical constraints of any coverings that are taken into consideration, such as the stationarity test, which determines if, if there's sufficient turbulence to be able to to use the any coverings data. And that the signal is stationary and not changing by too drastically so a lot of times the stationarity test for instance will will fail in the transition from nighttime to daytime. And then we also have a integral turbulent characteristics flag to determine if there is sufficient turbulence to do the covering. So now that we we've looked at that let's dive a little deeper into the quality flag and quality metrics so we'll take our data frame. We'll feed that into and just grab the QSQM label. And then we'll pivot the data from a live format to a long format and group it by variable so this will allow us to kind of be able to summarize how much data are flagged for each of our flux variables. And then we'll plot that using GG plot again. But this time we'll be using a bar plot. And so here, we have a good kind of synopsis of how much data are flagged for each of the data products across this couple of month window. And you can kind of see that the NFA variables are always the most flagged. And that's because it's kind of a combination of the turbulent flux and the storage flux so it's compounding the flag data as well. Often you will notice that the storage system has more flags raised than the turbulent system and that's just because there's so many additional moving parts and points of failure potentially when you're taking the consideration of the pumps for each level, the sensors, the valves that go into allowing the data to come into the hut and flow into the sensor. So that's the reason that we see kind of the higher flag percentage for storage fluxes. Additionally, we're pretty conservative with our flagging currently and if you were wanting to work with the storage fluxes, you could download the expanded file and use just the flags that you feel are most important to taking into account. So now that we've kind of analyzed how much flag data we have, we can look at removing the flag data before we do any additional analysis on the data. So we run here, you'll see we're basically feeding in the flux data frame and selecting all the quality flags associated with the CO2 flux data and summarizing those. And then we're doing the same with data to see how many NAs we have currently. So you can see here in the output we have, we'll just focus on the turbulent data for now, around 4,300 raised final quality flags. Currently we have 1619 NAs. Now we will apply these. We'll look and grab all the data that have a quality flag, a final quality flag user one. So we'll, using the which function, we'll grab the index of all the data that are flagged. And once we have that index, then we can use that index to change the data, the data in the flux CO2 variable to NAM. And we'll do this for the turbulent, the storage and the NSA. We still have the same number of flags. But now if we look at the number of NAM, we see that we have a greater number of NAM than the flag. So we've applied all those NAM, all the flags to the data. So now that we've kind of munched the data and cleaned it up a little bit. So we'll kind of plot the data, looking at the data in a different light. So now we want to look at kind of the deal cycle of the carbon fluxes. So we'll start by creating an hour variable based off of time begin. So we do that just by using the lubricate function hour, and then creating a factor variable out of that. So once we have the hour variable created, we can feed the, the flux data frame into ggplot again. This time we'll be using the hour variable on the x-axis. And we'll be plotting the turbulent CO2 flux on the y. And we'll fill by site. So this will allow us to evaluate the Stiger Waltz site versus the Treehaven site. And then for this plot, we'll be using box plot. And we're going to add a summary line that just goes through the median of the plot to kind of give us an overview of the daily cycle. So when we plot that, you can see this is how the daily cycle of the turbulent CO2 flux looks. And we can see that we have boxes for the red boxes are for our Stiger Waltz site and the green boxes are for our Treehaven site with the lines running through the median. Additionally, the nice thing with the box plot is that it kind of gives you the median as this middle line. But then we also get the interquartile rain, so from the 25th percentile to the 75th percentile outlined. And so it gives us an idea of the kind of variability of the fluxes for each hour of the day at each site. And you can see that we have greater variability at our Stiger Waltz site, which is a younger forest than we do at our Treehaven site. Additionally, you can see the lines of the whiskers. These kind of represent the bounds of the threshold for outliers. And they're calculated as 1.5 times the interquartile range. And then you can see the points here that are just the outliers or anything that's outside of those bounds of the whiskers. And so this can be another good way to kind of validate your data as well. But one thing does look funny, right? Like the hour we see that the carbon being taken up at like seven, you know, the peak is around 17, 1600, which we wouldn't expect, right? And that's because all of our data are in GMT or UTC. And so maybe we want to modify the data to look at local standard time to get a better idea of when the forests are most productive. And we can do that pretty easily since we have the metadata within the HDF files. So we can grab one of our HDF files by using the list files and just grabbing an H5. So now that we have one of the HDF files in our file meta variable, we can use the RHDF H5 read attribute function to read in that metadata. So you can see we have a variable here now. It's just a list of the site metadata. So what we kind of looked at when we were looking at the HDF files and HDF view. And in this, you can see that we have a time difference between UTC and local time. And so this will allow us to easily calculate local standard time. So that's what we'll do next is take our flux data frame and create a new variable called time begin local standard time. And we'll calculate that just by taking the time begin and adding the hours from our metadata in the time diff UTC LT variable. And we'll create an hour LST and redo our plot. And that looks a lot better now we can kind of see the diurnal cycle that we would expect for flexes at our site, where the productivity starts to go up and as the plants photos inside in the morning so around six local standard time and then it starts to die around to die down around five or six in the afternoon. So now that we've looked at the turbulent flux, we may also want to look at the diurnal cycles of the storage flux. In this case we'll just feed in the data similar to what we did for the turbulent flux will just change the variable to storage as opposed to turbo. Here you can kind of see an opposite effect of what we we had for the turbulent flux, so the storage flux is usually greatest at night and during the kind of transition period so the morning transition and the evening transition. And so, these are particularly important during these transition times and if you're doing analysis on shorter time periods over a longer period like the entire year. We don't expect as big of a contribution to the overall net ecosystem exchange. It can definitely be important on shorter time scales and and still contribute up to 10% or so on the annual scales. It's not a little more. So then, now that we've looked at the storage, we can look at the combined. D flux. And you can see that this plot looks a lot more like the, the turbulent flux as it's dominated by those boxes. Lastly, it's really hard to kind of get a little bit of help cumulative flux without doing some additional processing. So we would need to do some new star filtering from gap filling and some partitioning of the data from net ecosystem change into gross primary productivity GPP or ecosystem respiration to kind of get a really good feel of the overall carbon dynamics of the site. But just to kind of get a feel, we can do a mean relative in SAE calculation to see how much carbon on on average. These two sites are taken up. And so what we'll do here is we'll feed in our flux data frame. We'll look for, for anything that contain data and flux CO2 and or site. And then we'll group the data by site and summarize using the mean. And when we do this, we'll see that we we get an output table and overall mean carbon flux for the net ecosystem exchange or the net surface atmosphere exchange, the storage and for the turbulent components. And you can see overall for these two sites that Steiger wall takes up a little bit more carbon than tree haven, which is kind of to be expected as it's a pretty young forest that's growing pretty rapidly. Whereas the tree haven forest is a little bit more mature. But you can kind of see to that the tree haven actually takes up more carbon with storage. So whether it's a understory grows or or some other dynamic going on there. It's actually kind of bringing it closer to Steiger wall and overall NFA as opposed to, you know, the bigger difference that we see with the turbulent flux. So that's it for the tutorial. Now we can kind of take questions and let me know if anybody's having any issues or trouble with with running the tutorial as well.