 believe you can answer your own data analysis questions. Do you? If you do, stick around for this week's edition of Code Club. I'm your host Pat Schloss, and my goal each week is to help you to grow in confidence in your own data analysis skills to ask and answer questions about the world around us using data. A couple of weeks ago my wife, Condor, 15-year-old Joe, in a rototilling a big section of our pink pen so that she could have a garden in it this year. She's been very ambitious this year in planting this garden. The garden has grown from being very small to very large and even larger. A number of years ago I tried to create a garden that was monumental. I figured I have a big tractor, I'm gonna have a big garden. It was way too big once the weeds started popping up. Anyway, like every other family this year we're thinking about gardening too. So nevertheless my wife got Joe to rototill the garden that I can kind of see out my window here of my office. A big area and she started planting it little by little and it's been helped along this week because we had a bunch of rain the last few days where we're at here in southern Michigan. Another thing that we got going on this spring is that I've got about 20 apple trees in the front along our driveway that are starting to blossom. I'm gonna start seeing the petals but I'm kind of a worrier and I worry about late season frost. In the last code club I talked about how cold it was a couple Saturdays ago or it was just like you know historically cold weather. Well is it gonna be cold again and will it kind of kill my wife's tender plants or kill my my tender apple tree blossoms so we don't get any produce in the fall? We don't know, right? Well a normal person might go to a website to figure these things out where you could do a Google search for your town or your city with frost state and they would tell you what date you're safe transplanting plants or having plants after a certain date and that you're very unlikely to get any kind of frosts. The one that that we had I found I think was called like Gardening in the Mitten. I have it in the web page version of this code club but you could go to that website and they will tell you you know you know what's the last date that you would expect a frost in the spring and what's the earliest you would expect a frost in the fall so that you would know that your garden is basically done for the year if you hadn't already given up and kind of run by weeds. So you know a normal person like I said would go look at one of these websites but friends we are not normal people. We've got computers, we've got data so why don't we use that data to answer the question herself building on some of the skills that we've been working with over the last few weeks in code club and so we'll be able to use that same data that I shared with you last week from a NOAA weather station that is located in in Arbor, Michigan on the University of Michigan campus has data going back to 1891 and so we'll learn about a new function called Mutate which we've actually seen in previous code clubs but I've never really talked about and we'll use it with functions that were our good friends at this point filter, group by and summarize to answer this question about frost date here in southeastern Michigan or perhaps you could get your own data from where you're from to answer the same question but for your region. In the exercises we'll take that idea and you'll be able to take it on to look at other questions. I'm gonna look at the early frost date, you'll look at the late frost date, there's also a rule of thumb that I've heard recently that if the low temperature and the high temperature for the day combined to be greater than a hundred degrees Fahrenheit then you'll have a good grass growth so I've got a lot of sheep and cows and it's important to me that this grass starts growing because I'm sick of paying to feed them hay so if the grass starts growing that would be great so when can I expect each year that the grass will start growing. So these are the types of questions that we'll be able to answer after today's code club. So again at the code club website on this page titled so cold and dexter we're gonna go down to the prompt and you can see that we have this section here this first code chunk in the gray box and I'm gonna go ahead and copy this over into our studio into a new R script I'll go ahead and save this as a so-called in dexter.r it's my R script if I run the whole thing everything loads eventually a weather and we get to see the data frame that we worked with a lot last week you can also learn more about this data set by going to my general R materials that are also linked at the top right corner of the Riffamonus website we spend about five sessions going through analyzing these weather data using different tools and tricks from DePlyer. So the question again is what is the latest that we can expect a frost in the spring and so a frost is any temperature below zero so to break this down a bit I'm gonna go ahead and write up what we call pseudo code and so the first thing I want to do is I'm gonna put these as comments because later I might want to come back in intermix code or I might want to intermix other ideas here so I need to determine whether a low temperature was below freezing right so in degrees Celsius it's below zero it's freezing okay I need to then a low temperature for the day right so we have every day from 1891 to present so we want to look at every day and tell us whether or not that temperature is below freezing and then we want to aggregate the data by month and year month and day sorry so for every day of the year we want to know what fraction of years was there a freezing temperature on that day so the third question then is to determine the fraction of years that a month day pairing had a frost and then finally we want to find the day it's in May where we're where there's below a 5 or 10% risk of another frost okay so this is our pseudo code and any time I'm trying to take on something that's a little bit more complicated than a one line chunk of code I like to put up an outline there because it really helps me to organize my thoughts and I always encourage others to do the same it's kind of like writing a paper that we'd like to start out with an outline so we can then go back and flush it out to make the problem easier because looking at a blank page on your computer is pretty overwhelming right all right so we're going to start with us and we know how to do nearly everything here okay we're gonna add one new thing which is the mutate function and we'll see something special about using logicals as we go through this so the first function we're going to talk about is mutate now mutate allows us to either change an existing column or add a new column okay so if we want to make a new column that says is it below freezing well we can use that do that with a mutate function okay so we're going to do a weather and I'll pipe that then into mutate to create a column or a variable called below freezing and that then is going to be equal to is the team in lower than zero okay so we'll say team in less than zero so as I was looking around at other websites some websites said well is it going to be a soft freeze or a hard freeze or a severe freeze right so they might use different demarcations for defining those different levels of freezes but for now if it's below zero I'm going to consider that a freeze a frost okay so in this syntax team in less than zero is going to give me one of three different values either a true because it's below zero a false because it's zero or higher or an NA so an NA would happen if say something happened with the thermometer that day at the weather station and it for some reason didn't record a team in value and so it'd be an a because we don't know what it was okay so if we run these two lines of code we now see that we get an extra column at the right hand side of this data frame called below freezing and we see under that in the the brackets the ankle brackets LGL which is short for logical okay so one of the nice things about working with logicals is that they also can have a numeric value and so I don't know how you remember this or some mnemonic I'm sure there's something but false has the value of zero and true has a value of one I think of false as kind of like a form of nothingness right so if someone is lying to you and everything they say is false they're like a nothing right they're kind of an empty suit so empty zero right whereas truth is wholeness right that if we have truth in our life then we have completion and one is a symbol perhaps of completion for you okay so we could also do this in our if we forget that philosophical rant we could create a vector with a series of true and false value so down here in the console I'm gonna say my vector and I'll say true false true false true okay so then my vector has a series of truths and falses and so it's important to remember that these truths and falses don't have quotes around them if they had quotes around them then they'd be strings and they wouldn't so easily be considered a logical value okay so I can use to demonstrate their numerical value I can use a function called as dot numeric as dot numeric my vector and you can see that those truths were turned into ones and the falses were turned into zeros which is pretty pretty sweet right because we can now use my vector that vector of truths and falses or below freezing in other functions that normally take numeric data the ones I usually use would be some and mean I use those frequently with logical data so some is going to add up all the values and that will tell you how many values are true so if I do some my vector we get three right now what would mean be right well mean would be the average or the mean of all those zeros and ones and that would tell you the fraction of the values in my vector that are true okay so that's pretty useful right that's pretty slick and again it's something that we can utilize going forward and so you know again we could think of some my vector so you forgot about the mean function divided by length my vector to get the same thing right but mean is a lot easier and and if you do have an NA value then I believe the the mean function will we'll see we'll be we'll perform a little bit better anyway okay so let's go back to our example and let's add to the end of this summarize summarize frack below freezing equals mean below freezing and we get an NA value alright so as I mentioned this is where we get into the problem of having NA values in our logical vectors and so an NA value we don't r doesn't know what to make of that in the mean function so if you look within the help documentation for some and mean you'll see that there's an argument you could use called NA dot RM and we can add to the arguments of mean NA dot RM equals true and what this means is before you calculate the mean remove those NA values okay so it won't count in the numerator or the denominator and so if we remove and the NA's then we see about 35% of the days between now and October of 1891 there was a freezing temperature right so this is northeast this is southeastern Michigan it gets cold here in the winter and and that makes a lot of sense right so but we don't want to know the total fraction of days that where it was cold there was below freezing we want to know for for today May 21st you know what fraction of days or what fraction of years on May 21st has there been a frost okay because I want to know do I need to keep worrying about whether or not there's going to be a freeze and so I can then add to this like we've seen in previous code clubs a group by function to say group by a year I'm sorry group by month and day it says it doesn't know month why doesn't it know month let's see a weather what should no month group by month day ah so it doesn't it doesn't know the month because I still have my summarized function in here right so I need to put my group by before the summarize function or that and now we're now we're doing it we make this window a little bit bigger and we see that in the first column we have the month the second column the day and the third column is the fraction of years where it was below freezing so on January 1st we see that 93% of the time we have a low temperature below freezing that makes a lot of sense right so I'm interested in May because that's when things start to warm up here in southeastern Michigan and so if I want to look at May hopefully you're saying with me ah we need to use the filter function so also say filter month equals five for the fifth month being a and we can then see those rows for the month of May great well May's got 31 days I think if I want to see all of the rows for that month for that data frame I could say print n equals 31 and it will then output data for all days of the month right and we can see that early in the month so like May 1st there's about a 8.7% risk of a frost right so 8.7% of previous years had a frost on May 1st right and so that's about that's close to 10% right and certainly if we think about anything after May 1st well then that gets pretty big too right so if we look at these we might start thinking about you know if you kind of think about a cumulative risk that in the month of May things get a little bit better as we go down the year down the month sorry and that you know once we get past the 21st there's never been a frost right there's never been a frost on the 22nd through 31st and you know if we kind of tick backwards and get to a point where maybe you know we're willing to accept say like a 5% chance of a frost then we're probably coming back to about I don't know so that's like one two four five so we're probably waiting until like the 13th or so to stop worrying about a frost so if we get to the 13th and there's maybe only a 5% risk that will have a frost the rest of May and so that's encouraging perhaps if you know if we're willing to be a little bit more risky if we went to the 10th then maybe we would we'd only have a 10% risk of a frost later in the year and certainly certainly if I went to that gardening in the mitten web website they would tell us that maize that the in arbor spring frost should end by May 10th which largely agrees with what we've got right so if we want to be a little bit more cautious we might wait to the 13th so we don't have a 5% risk but you know we could maybe move things up a little bit to the 10th and so then we'd only have like a 10% risk of things to worry about there okay so we could certainly make a plot of the whole entire year to see what's going on there that's going to require a few extra steps that using things like lubricate and ggplot that we're not quite ready for yet so in a future code club we'll cover how to do that how we might turn these probabilities into a plot showing the risk of a frost on any given day as a plot going forward we could also think about cumulative risk right so here we're showing the daily risk of a frost but certainly you know if someone has there's a risk of a frost on the first then it's perhaps I don't likely that the next day might also be a frost right so then you have an added risk as you go through so that's another function that we'll talk about in a future code club called cum sum CUM SUM all right so hopefully you now feel a little bit more comfortable about using the filter function the group by function and the summarize function along with some of these arithmetic functions like mean and and some but also this new function we've talked about called mutate which allows us to create new columns or to replace existing columns as we go forward we'll see how we can use these in more interesting settings to answer some provocative or boring or just silly questions that we could have pulled up from another website but at least I own this result now right like and so I always feel like that's empowering that I can pick a date based on the risk that I'm willing to accept not some website was willing to accept for me all right so go ahead and pause the video now engage with the exercises that I've given you and then after we come back I'll go through these answers with you and we'll see whether or not we get the same result hopefully you found those exercises engaging and learned something more about analyzing our data with our to answer a real practical question I'll be at a question we could have gotten by just a simple Google search anyway the first question was when do we expect the vegetables in our garden to stop growing so to answer this I'm going to go ahead and copy the code chunk that we had above and instead of looking at May because that's the start of the growing season I'm going to look towards the end of the growing season and so let's start with September because there are some cold days in September I'm not sure that they're necessarily frosts so if we run that code chunk and look we see that maybe in one year out of the past hundred and thirty there's been a frost on September 1st but really nothing to speak of until you know we get to like the 22nd or so now if I want to be confident that there won't be any more growth then maybe I want to look more like a 90 percent chance of a frost having happened so I'm going to go ahead and now look into October and again we can look at October or any month we want by changing that month parameter and so here we see that you know looking at the fraction of days below freezing for each day of October that you know if we got around the 18th you know then we're kind of getting to the point where maybe 90 percent there's in the previous days there's been about a 90 percent chance of a frost accumulative chance okay and so if again if our garden makes it to the 18th that's about as far as we think it'll go because after that then we're basically guaranteed to have either already had a frost at that point okay so if you can get from May 10th to October 18th it's about five months here in Michigan you're doing really well for your garden you know it seems like a really short growing season and certainly for me growing animals on a pasture where I need the grass to grow I probably need even warmer temperatures so let's look at that question in the second question so the second question that we want to deal with is how many days in Ann Arbor have a temperature greater than 90 degrees Fahrenheit okay so this is a little bit of a different question because we need to make a new column right we're not going to be dealing with the is is below freezing column we need to make a new temperature a new temperature column that isn't in Celsius but is in Fahrenheit right I could have turned 90 into Celsius but that kind of defeats the purpose of this exercise so I will come up here and do a weather and pipe that into a new column called mutate with mutate and I will call this T max F as opposed to what we had was T max in Celsius and the conversion between Celsius and Fahrenheit is 9 divided by 5 times the degrees in Celsius plus 32 so I run that I see I have a new column here which is the max the high temperature in degrees Fahrenheit over on the the right side of the column but now I'm going to do very much what I had done before with group I and summarize now I want to know for each year what is the probability of having a temperature greater than 90 degrees and so I'm going to create another column here which I'll just call is hot because that's hot for me so we'll say is hot and we'll say T max F greater than 90 and so now we have an indication in our columns here of whether or not the temperature for each day was hot at least by my standards so what I'm going to do now is for each year I want to know how many days was it hot and I'll pipe this then to a group by group by year and then with a group by I then generally do a summarize and we will then do total hot days and for this we'll use the sum function and remember a sum over a logical vector adds up all those zeros and ones to get you a total number of true values right so we're summing up the number of times that is hot is true so we'll say some is hot and again we're going to want to include that na.rm equals true to remove those na values and so now what I have is for each year the number of days where it was above 90 degrees now I don't have complete years worth of data for 1891 or 2020 so I want to remove those so I'll do filter year greater than 1891 and year less than 2020 okay now this is the total number of hot days per year I want to know the average number of hot days per year so I can then pipe this into another summarize function to do summarize of hot days mean total hot days okay so we see on average in Ann Arbor there's about 9.4 days where the temperature is above 90 degrees okay now you could do more and you could perhaps plot this as a histogram we'll talk about those concepts in another code club but if we wanted to look at a higher temperature say like 95 how many days over 95 do we have here in southeastern michigan generally about one or two days above 95 and let's say 98 maybe every other year one in every four years we have a day over 99 and similarly for 100 degrees it's it just doesn't get that hot thank goodness man that would be brutal I don't know how people in the south do it so again this is helping us to see on average how many days would we have greater than 90 degrees again the principles and the kind of the order and flow of the functions is very similar to answering this question as to the question of you know when is the earliest frost and what is the the fall frost date okay so let's try this again now with a different question which is based on a rule of thumb that I recently heard on social media that if you add the low temperature and the high temperature for the day and the temperature is above 100 degrees then grass will grow okay so again I have sheep and cows that are out on pasture and I need the grass to grow because hay is really expensive and I just I just love the look of puffy white sheep out on pasture or my son's black and white belted galloway is out on pasture too so when can we get them on pasture when can we expect the grass to start growing okay so again this sounds a lot like the frost date question right but we're going to use a slightly different mutate function that looks kind of like the one we just did in the second exercise I hope you see that these questions are all related and they do build on each other so we'll do another mutate and so I'm going to convert my min and my max temperatures from Celsius to Fahrenheit and we then want to create a column that says is that a sum of those two greater than 100 so I'll do t max f and I'll do nine divided by five times t max plus 32 I'll do t min f nine divided by five times t min plus 32 that should be a comma not a period so let's see what that looks like so it's upset with me oh because I put an equal sign instead of a plus sign see I make a lot of typos too so we now see that we have those two extra columns I'm going to go ahead and add a column that I'll call I'll call is growing grass grass growing maybe and we'll say t max f plus t min f greater than 100 so again if those two added together is greater than 100 the rule of thumb goes that it's the grass is growing and we can see here that these first dates of our our data frame suggest that here in October that we have grass growing at just over 100 degrees right so again we want to aggregate this data to look at whether or not or when when the grass starts growing okay and so like we did with the frost dates we're going to go ahead and group by month and day and we'll then summarize frac growing days to be the mean of grass growing and again we've got na values in there so I need to go ahead and add that na.rm equals true to get the you know fraction of years where we've had you know theoretically a growing day and so we see that maybe one or two days or I guess maybe about five days in early January we've had temperatures warm enough for grass growth Michigan weather is very unpredictable and very erratic but in general as they say April showers bring May flowers that's another one of these rules of thumb or heuristics is that true I don't know maybe probably not so much in Michigan so if we then do print or let's do filter month equals five and then print n equals 31 again this code chunk here is almost identical to the code chunk for looking at the frost dates the difference is which what we're doing to mutate you know what kind of column are we creating and what logical question are we asking otherwise this framework is identical to what we had done for the frost dates so I hope you can see that and that this general structure to the code can be adapted to be to answer many questions as we've seen already in this code club so if we look at this then we see at the beginning of the year beginning of May there's about a 70 chance each day of grass growth but you know we don't really have reliable grass growth according to this heuristic until probably maybe like the 23rd of May or so okay and similarly we could we could also look at the end of the year by changing this month to 10 perhaps and seeing you know when do we stop getting reliable grass growth well it's probably at the end of September right and that temperature is really cool down and so we could maybe look at September instead and so we see yeah that like you know the you know 90 percent of years on the 24th of September we still have grass growth temperatures and after that you know the probability starts falling off of course if you know anything about grasses our pastures there are different types of grasses that do well at different temperatures and so this again is a heuristic a rule of thumb that would be interesting to correlate with soil temperature which is another variable that's in these data sets and so perhaps we could we could evaluate this heuristic to see if that T min plus T max being over 100 degrees Fahrenheit matches with agronomy data or soil temperature data and to look at the different types of grasses that grow at different temperatures but again that's a lot of work and so hopefully you've seen here in this code club what we can do with a few lines of code and a question thanks again for joining me for this week's code club be sure that you take time to engage with the exercises to strengthen your skills and to continue to practice even better would be for you to ask your own question using this data set to answer a question that's relevant to you whether that's getting data that's for your local area or again asking a question that makes you curious please engage the material i'd love to hear what you're trying to do by leaving a comment down in the comments if you run into a hurdle let me know what it was i'm sure there'll be other people that are having the same challenge and i'd love to cover that in a future code club and perhaps take on the question that you were trying to answer to help others to learn are better keep practicing and if you've enjoyed this please be sure to tell your friends about these code club videos like the video to help others find it please subscribe to the rifamonis channel here and then also click on the bell so that you're notified when the next code club video is released i'm releasing these every thursday at noon so for look for the next one on may 28th around lunchtime until then keep practicing stay safe and take care