 I believe you can answer your own data analysis questions. Do you? If you do, stick around for another edition of Code Club. I'm your host, Pat Schloss, and my goal is to help you to grow in confidence to ask and answer questions about the world around us using data. Let's think data. This week's Code Club will be grappling with one of Spring's most existential questions here in Southeastern Michigan where I live. When is it finally going to warm up? On Friday evening, I helped my wife bundle up about 62-week-old chicks because the overnight low temperature was supposed to go below freezing. When I woke up in the morning and looked at the thermometer, it was 26 degrees Fahrenheit, or about minus three Celsius. Thankfully, all was good. The chicks all survived, but that got me thinking, how extreme was that low temperature? Today we're going to answer that question using data that I grabbed from the NOAA weather station located just outside of Ann Arbor. That they have temperature data going back to 1891. We'll do this using many of the tools we've seen in previous Code Clubs. Filter, summarize, and group buy. Plus, we'll learn a couple of other functions, including Arrange, DESC, and Top-N. More importantly, we'll see how to combine all this tidy-verse goodness to answer a question of interest to us, or at least to me. Let's go ahead and head over to the Code Club page for today, which is hosted on the Rifomonas website at rifomonas.org. As I mentioned, we're going to be using data collected from a NOAA weather station located just outside of Ann Arbor. Maybe someday I should go track down and see where it actually is. The nice thing about this data is that it's all accessible online and that you could update your data regularly to get the bleeding-edge data, so you could get today's data or yesterday's data. You could also get data for your own location. You might not care about Ann Arbor, Michigan, or Southeastern Michigan. Perhaps you're here in quarantine like the rest of us and you're thinking of a nice beach, maybe down in Miami or out in the Keys. Or perhaps you're curious, what was the weather like on the day I was born? Well, you can go to the website for NOAA and you too can get data for that location. So before we dig into the tidy-verse data and answering our questions, I want to show you briefly how I got the data that I'm providing for you through a GitHub site, okay? And so what we can do is if you click on this link for climate data online search, you can then, in this first dialogue window, select a data set. I'm gonna pick daily summaries and select date range. I'm gonna, opens up a dialogue here. So I'm gonna throw this back as far as I can. I don't know that we have data going back to 1763 in Ann Arbor, Ann Arbor, Michigan wasn't, didn't exist in 1762 anyway. So I'll go ahead and click apply. What I'm doing is I'm trying to get like the broadest possible range of dates. I'm gonna search then for cities and I'll go ahead and add in Ann Arbor. It's probably best to pick the biggest city around you. Ann Arbor is not huge by any means but what you're trying to go for is a place that is likely to have really long-term temperature data. I live in Dexter, Michigan, which is a pretty small area, a town that only has a couple thousand people. So it's unlikely that there's a weather station here going back more than 100 years or perhaps more than 50 years. So I'm gonna pick Ann Arbor because it's a big area. Probably could have also picked Detroit, maybe Lansing. You pick an area that's interesting to you and you might iterate as you do this to find a weather station near you that goes back as far as possible. So I'm gonna click search and this brings up a map view that's got this nice gray circle around Ann Arbor. I live here in Dexter. So it's a pretty good, pretty good window here. And then over here on the right, I'm gonna go ahead and click on Ann Arbor, Michigan, US. And I see that the period goes from 1891 to 2020. So I'm gonna hold off on clicking add to cart because within this circle, there's actually many weather stations and NOAA will limit the amount of data that you can download all at once. And so all I want is data for one weather station going back as far as I possibly can find. And so I'm gonna scroll down here and you'll see down in here included stations. So if I click on station list, this pops up all the 54 stations that are available for Ann Arbor. And so if I click on start, I see that there's one that starts in 1891, October 1st and comes all the way through May of 2020. Okay, if I look through here, maybe I wanna look at Dexter. And so you can kind of toggle through to see the different cities near you. So here's Dexter, this first three for me. And again, they only go back to 2012. So that's not that far. So let's go back to the first page and sort. And so what we want is this Ann Arbor, you have Michigan, US. And again, you pick your region that's interesting to you. But the key I found was to pick the earliest possible starting date and the latest possible end date. And then you click add to add this to your data cart. And what you'll see is that there's all sorts of data available here. Air temperature, land temperature, precipitation, a variety of other things. Some of these things don't have data going all the way back to 1891. But if we look at things like air temperatures, we see that we have a min and max temperature for the day going back to 1891. Other things like an observed temperature at a specified time each day goes back to 1926, which don't get me wrong, it's a long time, but we can get another 35 years if we go back to 1891. And for our purposes, that works pretty well. But we might also think about precipitation, like when is the last snowfall each year and things like that, right? People always talk about there was snow at U of M graduation in whatever year, is that actually true? We could find out, right? Great, so if you scroll up, you'll see a tab in the upper right corner for cart free data. So if you click on that, you'll see that you get a variety of cart options. So what you want is the custom daily CSV file. We don't want a PDF. PDFs are no good for data analysis. And we wanna make sure our date range is as wide as possible. Again, going back to the beginning, coming all the way through the end. So the data doesn't actually have Saturday morning yet, I'm noticing. So this is what we want. You can go ahead and click continue. Sometimes you might get an error at the stage or perhaps at this next stage is where the error comes in. So I want the station name and I want error temperature and precipitation. And so I could click continue. And so this I think is where you generally get errors. You can then go ahead and insert in your email addresses. It will then email you, if you submit this order, it will then email you to tell you when the data are ready. So you might get an error kind of around this stage saying that you've asked for too many years worth of data. And that generally happens if you, like we said, if we had like 50 stations within the Ann Arbor area, it won't give you all that data. There's ways to get all the data, but that's kind of way beyond the scope of this code club. So again, I'm giving you data to work with for Ann Arbor. I wanted to show you where I got mine from so that you could go and find the data from where you're from. Okay. So we'll go ahead and transition now back to the code club webpage. And at the prompt here, I have a handful of lines of our code, which I'm gonna highlight and copy. And then we'll go over to our studio. Once I'm in our studio, I'm gonna go ahead and in the upper left corner, click on the white rectangle with the green plus sign for a new R script. And I will then paste in my lines of code. So I'm not gonna use the right side of my R studio window. So I'm gonna get rid of that so that I can make my font a little bit bigger so you can more easily see what I'm doing. And I can then highlight all this and click run. It will then run my code down in R. And what we'll see is that we'll get a handful of errors here. Don't worry about the errors. They're not errors, they're warnings about how it read in some of the other data. For now, trust me, that the data worked well, reading those in. And we can of course look at a underscore weather and see what our data frame looks like. That we have a date, we have a T max, so the maximum temperature for the day, the low temperature for the day, the observed temperature. Remember that observed temperature only kicked in at around like, what would we say, 1926 or so. And then we also have the year, month, and day. So the first thing I wanna do is see what does the temperature look like on May 9th. So I'll go ahead and do May 9th equals, and then I'm gonna take AA weather and pipe that to the filter command. And so hopefully you remember the filter function retrieves rows for us that satisfy a certain set of logical criteria. So we'll say month equals equals five. And so we see that our month column in the AA weather data frame is a number. And day equals equals nine. And so if we look at this data frame, what we see is that we have 127 rows for about 127 years worth of data for May 9th, right? So we could scan through this and we could say, did it ever get to minus three for a T min on May 9th? And that would be very tedious and very error prone. And if we wanted to look at what that looked like perhaps across all days, it would be just really miserable and really painful. So a function that we learned in a previous code club was summarize. So we could do May 9th and we could pipe that to the summarize function. And we could then say, what was the average high? What was the average low? What was the average observed? So we could say, ab high. We'll use the mean function for T max, ab low, mean T min. And then ab abs for mean T abs. And then I always like to get an N so I can count the number of observations and that's done with the N, the work is N function. And so this outputs a table, it's a one row table showing us the average high for May 9th is 19.8 degrees Celsius. The average low was 7.85. NA for average observed and that we had 127 rows. So something that this reminds me is that the mean function has a argument that you can use of NA.RM that removes any NA values. So this gets really tedious writing out the functions, the pipeline all in a single line. So what I'm gonna do is I'm gonna copy this up into my R script and I'm going to break it apart on different lines so that it's a little bit easier to read and to manage. I can then add the argument here to NA.RM equals true and I can again run this and now we see that we do get an average observed temperature for May 9th of 16.9 degrees. So again, that's the average, right? We wanna know about kind of the extreme values. So we could add to this something that'll tell us about say a 95% confidence interval on the day. And so what I will do is I'll take this pipeline and I will do LCI high and what we'll use is the quantile function, quantile Tmax and we can give it a probe probability. And so if we want the 95% confidence interval we wanna go from about 2.5% to 97.5%. But we won't write those as percentages, we'll write them as probabilities. So for the low, we'll do 0.025 and then for upper confidence interval for the high, the Tmax, 9.75, don't forget your commas and then we'll do the same thing for the lows, Tmin. And I'm not so interested in the observed so I'm gonna go ahead and get rid of that and I can then run this and it's upset that it can't find Tmin, Tmin, not Tmin. All right, so what we see is that the lower bound, the 95% confidence interval on the low temperature for May 9th was about negative a half a degree. So it's not bizarre for it to be freezing but still we're a couple degrees cooler than that even, right? So perhaps we wanna know, well, what was that minimum temperature for the day? And we have a couple ways that we can do this actually. So we could do May 9th, let me do it up here. So I have it all together. So I'll do May 9th and I can do summarize and I can then say historic low and then I could do Min. So the Min function returns the minimum of Tmin and I see the historic low was negative 2.8. So we broke that record. We had a lower temperature than the historic low for May 9th. What was the high? Well, we could do May 9th, historic high. I'll do max, Tmax. So it was 31 degrees Celsius and so that's probably about 90 degrees. Wow, that's pretty warm for May. Michigan weather is very unpredictable. So this tells us the Min or the max for those for May 9th over 120 some years. But what day was it? What year was it that it was negative 2.8? What year was it that it was 31.1? So another way that we can get this is that we could take May 9th and we can sort it by a column. So I could do a range by Tmin. And so this is gonna arrange my May 9th data frame by the low temperature for the day. And what we see is that it's still a data frame with 127 rows but that we now have it sorted in increasing order by Tmin. So that we see that the low temperature of minus 2.8 was set back in 1947. So almost 73 years ago or so, okay? So again, we broke a record. If we wanted to know the max temperature we could do the same type of thing but what we'll see is that it's arranged now by Tmax but again, it's an ascending order, it's going up. So the lowest Tmax was also that day in 1923 or near that day, I guess, or not quite, it was 1947. So in 1923, we had the lowest high temperature of 3.3 degrees Celsius, okay? But we don't want the lowest high temperature, we want the highest high temperature. So we can add a function to the arrange arguments of DESC and so we've got Tmax nested and DESC nested within a range. So DESC is short for descending. And so now if we look at this, we see that we've got a descending sort by Tmax. And so the highest temperature on May 9th was in 1963 at 31 degrees, okay? That's pretty, pretty warm for early May in Michigan, certainly much warmer than it was on Saturday morning. All right, so a third way that we can get back data telling us about the low or the high temperature over 130 years would be to then use the top N function. So I could do May 9th, top N and this is gonna return the top however many rows we want from our data frame. And so if I wanna do top N Tmax, Tmin, let's do Tmax first since we just were talking about that. And let's do, let's look at the three warmest days for May 9th over the years. And so what we see is that we have these three days. So 1930, 1936 and 1963, we had temperatures above 30.6 degrees. If I were to change this to say two, what we'd get back would still be three rows because we have a tie for 30.6. So it returns the top however many rows plus ties. So again, if I keep that at three, I get those three rows. Let's look at the bottom, at the minimum. And so to get the bottom three, the smallest three, we can say N equals minus three. So the negative sign tells R to look at the end. And so we see the three dates on May 9th, the three years on May 9th where we had the lowest temperature were 1923, 47 and 83. What you'll notice though, is that these are not arranged. They're not arranged by the temperature. They're giving us the rows, the three rows that had the lowest temperatures. So what you could always do is that you could then throw onto this an arrange. So you could arrange by Tmin. And now we have the three rows ordered by the lowest temperature. Excellent. So I'm gonna go back up to my code chunk here where I had May 9th and I'm gonna come to the bottom of my script and you may gonna slightly edit this for A weather and then we pipe this to filter to get month equals five, day equals nine. So we had previously defined a variable called May 9th and this is again, what we ran and what we found. And so that was pretty good. But May 9th is gone. Let's look forward to other dates, right? Well, what about our birth date, right? So I was born on June 20th, 1976. So if I look at June and 20th, so it's complaining because it wants na.rm in my quantile. So I'll do na.rm equals true. So just when you think you've got everything working, you realize you have another subtle bug. Okay, so for June 20th over 128 years, the average high in na.rm was 26.7 degrees and the high is about 33.9, okay? So again, this is one date, but there's 364 other possible dates. You've got a birth date, right? So we can change this very subtly so that we have this tabular output for every day of the year. And so what we're gonna do is I'm gonna go ahead and for now comment out this line for filter and instead I'm gonna use the function we've already seen called group by. So we'll group by month and day. And that way then we're gonna take our AA weather data frame, we'll group it by the month and then within each month we'll group it by the day. So for each month-day combination, we're gonna have something like 128 rows. And then for each of those month-day combinations, we're gonna calculate all these temperature values. So again, if we run this, sure enough what we find is that we get a table data frame with 366 rows. So remember there's leap years in there where we get the extra year, 2020 actually is a leap year. And so we can see January 1st, the low average low temperature is minus 7.37 degrees Celsius. So that's about 14 degrees Fahrenheit, I believe, 18 degrees Fahrenheit. Again, and then we could use this to look at any date of the year that we're interested in. And I could go ahead and save this to a variable that I'll call daily T summary for the daily temperature summary. And again, we see this output that we saw previously. So if I do daily T summary, I can then run filter. So let's say month, let's do my anniversary which was June 3rd, day three. Yep, forgot to double equal signs. Always doing that. So we see on June 3rd, the average high is about 23.6 degrees. The average low is about 11.8 degrees. Nice early summer day. Great, so hopefully this feels like a little bit of review from the topics we've covered in our previous code clubs where we were looking at functions like filter, group by and summarize, using the data that we got from 538 to look at the candy data and the grammar data. So what I would like you to do now is to pause the video and to engage with four or five different assignment activities that allow you to build upon what I've talked about to answer other questions. The fifth question has you going out to the NOAA website and getting the data for your favorite location. I would also encourage you, perhaps not during this video, but later in the week to come back to this and to think about other questions that you could answer to engage this data set with a question, something that's personally relevant and interesting to you. So we'll go ahead, I'll let you hit pause and I'll come back and I'll show you how I answered the questions. So hopefully you found these exercises engaging and allow you to kind of stretch your muscles a little bit with these new functions we've learned. And also we've seen many of these functions already and what educational scholars find is that when you take knowledge and apply it to a new setting, a new context, that you learn the material that much better. All right, so I've copied and pasted the questions from the website into my RStudio script, RScript. I've commented them out. And so within these questions then I'm gonna show you how I would answer each of them. So which year was the hottest on your birthday? Which was the hottest since you were born, okay? So I'm gonna take the A-A weather and the hottest on my birthday, I'll go ahead and do filter and I'll do month equals six, day equals 20. And just to make sure, I see my output down here is the temperature data for June 20th and I could then do top N, T-Max. I have to give it an N. And I see the hottest date on my birthday was in 1953 and the high temperature was 36.1 degrees Celsius, which is about 102 degrees, almost 100 degrees. Okay, what about since I was born? I was not born, I was not alive in 1953. Well, I'm gonna copy this and I'm gonna add one thing. And so remember, you can use these commas or you can use the ampersands. We'll go ahead and do the ampersands. And then I'm gonna do year greater than or equal to 1976, the year I was born. And so we see the hottest birthday I've celebrated if I lived in Ann Arbor was in 1995. Think in 1995, I was actually living in, I think I was living in Ithaca, New York at the time when I was in college, okay? So again, this is how I would answer that question. Okay, so what were the hottest and coldest temperatures recorded the year you were born? Okay, so we're gonna take a fairly similar approach using filter, right? So if it says something like the year you were born, I can do a weather. I'll pipe that to filter year equals 1976. And double check, filter, typing is hard. We see all these are dates for 1976. And so the hottest and coldest temperatures, I can then do top N and I can then say hottest will be T max, N equals one. So on two days in 1976, July 14th and July 15th, it was 35 degrees. If we want the coldest days, filter year 1976. Again, I could have defined a function or a variable like birth year, and then I could use that birth year to pipe into top N or these other functions. But eh, and I'll do top N, T min. Remember, I need N minus one to get the lowest value. And so it got down to minus 23.3 degrees Celsius. That's pretty cold. So what is that? That's times nine-fifths plus 32. It's about negative, almost negative 10 degrees Fahrenheit. That's cold. And that was on January 18th, it's pretty cold. All right. So calculate the average high temperature for each year between 1892 and 2019. These are the years that we have complete data for. And so it was the average temperature for the year you were born, and which was the coldest and hottest years that we have data for, based on the annual high average temperatures. All right. So this is gonna be a little bit more challenging, but we have the skills. So we'll do AA weather. And I'm gonna do a filter because I wanna remove 1891 in 2020 because we only have partial data for those years, right? And so 1891 is gonna look colder because we only have October, November, December. And 2020 is also gonna look colder because we only have these first few months of 2020. So we'll do filter year greater than 1891 and year less than 2020. All right. So that gets us our time range. And so it was the average temperature for each year. So I'm gonna then do a group by because I want a group by the year. I want the average temperature of every day across the days within a year. And I'll do summarize. So I'll do av temp, av high. And I'll do mean T max. And I see a few NAs in here. So I need to go ahead and remember to do that. NA.RM equals true. And I get the average temperatures for every day across the year. So I'm gonna save this as annual T. And then I can answer these questions. The year we were born, annual T and I'll filter year equals equals 1976. So the average temperature in 1976 was 14.5 degrees. And which are the coldest and hottest years we have data for based on this average, right? So I'll do a different way than what we did with top N. So I'll do annual T, arrange T max. But I want that DESC to get the average hottest year. Because I don't want to use that. I want to use this. No, I want average high, sorry. And what I see is that the hottest year was 2012. So just about eight years ago was the hottest year average across. I'm also gonna go ahead and do, add to this pipeline the average low as the mean T min. And I'll go ahead and add that TRM, N-A-R-M equals true. And I can then do annual T, arrange, have low. And we see that the average, the lowest year, the coldest year had an average temperature of 1.89 degrees Celsius. That sounds weird. I wonder how much data we have in here. So I'll go ahead and add N equals N. So I look back at this. Yeah, it says we have 366 days, but the average low was about two degrees. So that's about 36 degrees. So it must have been really cold in the winter and a pretty mild summer. Okay, so again, we could have done this also with the top N, but I wanted to show you how we could also do it with the arrange function. What is the average high temperature and 95% confidence interval for each month of the year? What is it for your birth year? All right, so we'll go ahead and do AA weather. And we're gonna pipe this into something where we want the average high temperature and the confidence interval for each month of the year. Okay, so if we wanna do each month or each year or each day, then that's telling us we wanna use the group by function. So we'll do group by month, right? So perhaps you wanna get married in the month of June or you wanna figure out what month in general is the best month to get married for optimum temperature. So or you're trying to plan a party and you wanna know what it's like in June versus July or January. We can then do summarize. We can want the average high temperature. So we'll do have high mean T max and then like we had above, what you can do LCI high and we'll do a quantile, T max, prob of 0.975. I want 0.025 because this is the lower confidence interval and then the upper confidence interval for the high T max, prob equals 0.975. Oops, changed my font size. And so I would go ahead and run that. T max not found, T max. Again, we've got our NA.RM problem and we need to add that there. And we see for each month of the year, the average high, the lower confidence interval and the upper confidence interval for that month. So if we do monthly T equals that, I can then do monthly, I mean, I could look at the table, right? But I could also use a filter six to see that the average temperature in the month of June is about 26 degrees Celsius. And again, we can do our other things like which is the coldest month, what is the warmest month? And you could use whatever month you're interested in. So what is it for the month of May? Well, let's do a five and we'll see that on average, it's about 21 degrees for the high temperature. So that's about 70 degrees Fahrenheit. I don't know. I don't know if it'll get there this May. Maybe towards the end of the month, we'll save a few days in the 70s, but I'm not holding my breath. Very good. So again, the fifth exercise has you go back to the NOAA website and try to generate the data or pull down the data that we've been using but for your favorite location. I'd encourage you to do that and go back and answer some of the questions I found interesting, but also to come up with questions that you think were interesting. Thanks again for joining me for this week's Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills and even better would be for you to take the data for your own favorite location and work through the questions again. I'd love to see what you did. Please feel free to drop a line in the comments below to tell us what questions you were eager to answer. Perhaps you have a question that you're not quite ready to answer because you don't feel like your skills are quite there yet. That's great. Tell me what it is and perhaps in a future Code Club, we can come back and we can go ahead and try to develop those skills to answer your question. Be sure to tell your friends about Code Club and to like this video. Please subscribe to the Riffamona's channel on YouTube and click on the bell so that you know when the next Code Club video will be dropped. These are coming out every Thursday afternoon, so please keep practicing and we'll see you next time for another Code Club.