 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another edition of Code Club. I'm your host, Pat Schloss, and my goal is to help you to grow in confidence, to ask and answer questions about the world around us using data. This week, we'll build upon our recent experiences analyzing LAN prices to learn how to calculate those rolling averages we've been hearing about in the news for COVID-19. To do this, we'll learn how to use the lag function from the de-plyer package. In recent episodes of Code Club, we have been looking at how to plot LAN prices for my local United Producers livestock auction. The data are pretty noisy. One way of smoothing the data has been to use the geome smooth function. By adjusting the value of the span argument, we have been able to make the smooth line more or less squiggly. With the ongoing COVID-19 pandemic, there's been a lot in the news about seven-day rolling averages. Rolling averages are also called a moving average. The NRB news has been plotting daily case counts along with the rolling average for the state of Michigan. I noticed that this representation of the data puts the rolling average at about the fourth day of the seventh day series. I've been able to find the raw data that were used to build this plot in an easy-to-access format. So naturally, this got me to thinking about how I could calculate a rolling average for my noisy sheep price data. So to calculate the rolling average, we'll build off of the code and concepts that we have been working on for the past few weeks. So don't worry if you missed those episodes. We'll start with the block of code that we used last week and see how we can add the lag function from the de-plier package to help us to calculate the rolling average. Along the way, we'll use the mutate and pivot longer functions again, but in a new context. As always, don't watch the video straight through without first firing up our studio and trying the code and exercises yourself. So if you haven't already followed the setup instructions, go ahead and pause the video so that you can go over to the Riffamonis Project website for this code club and make sure you have everything you need. To get going, I'm gonna go ahead and open up a new R script here in RStudio, and I'm gonna copy over the code from the code club session. That's at the beginning of the material. So hopefully you will call this chunk of code. Again, don't type this in, copy and paste it over from today's code club session. It's the very first code chunk that you'll see. And so you'll notice that we call in the Tidyverse library and package as well as the Google Sheets for package. We've seen this a few weeks now where we read in data from the Google Sheets page, reformat it, we rename the columns to make them easier to work with. And then last week, what we did was that we had noticed that our data were presented in ranges from min to max and that there's values are separated with a hyphen. So we used pivot longer to take all those columns and concatenate them onto each other. And then we used separate to separate the min and max values using the hyphen as the delimiter between those two values. And then finally, we calculated the midpoint price between the min and the max. And last week we generated some plots as well. So as I said in the introduction, last week and the week before we used geomesmooth to create a smoothed line over the observed data. These data are pretty noisy. And so the smoothed line gave a pretty nice representation of what the data looked like if we had smoothed it to give kind of like a average feel of what the data looked like. So what we want to do today is take that idea and instead of perhaps using geomesmooth, we want to see if we can use a rolling average to smooth out the data. So the function that we're going to use to do this is what's called lag. To calculate the rolling average, let's think through a few steps that we're going to have to go through to achieve that. So the first thing we want to do is we don't want to find the previous end days to go into our rolling average. Not quite sure what I want end to be, but I think I want to look at this. It's actually not going to be days. It's going to be weeks because the data we have comes out weekly. So we want to look at this week's data and I think I want to add in the three previous weeks. So we're going to do like a four week rolling average. The next thing we want to do is we want to represent those prices in separate columns. We then are going to want to calculate the average across the current and previous week's prices. Can't type today. And then what we'd like to do is perhaps to plot the data with the rolling average on top of the weekly data. So it seems like a lot of things but we already know how to do most of this. We're going to see things in a very different context. All right, for this first problem of finding the end previous weeks that will go into our rolling average, to do this we're going to use a function called lag. And to give you a little bit of a motivation for how lag works, let's create a vector that we'll call x and that has the numbers from one to 10. And so if I look at x in my console I have the numbers one to 10. Let's get a little bit more room here. And one of the things that you may have noticed when we ran the library function, I realized just now that I haven't actually run that library function yet. So let me go ahead and run that code chunk. It's asking me to enter my information. So I want to give it one, oops, yeah. Give it one to load my Google Sheet and authenticate with my Gmail account. Reads that in, I then will run this tidy auction data. Oh, I misspelled it. And there we go. We've got our data frame that we had from last week. All right, back to the idea of lagging our data that you perhaps noticed way up here when we loaded library tidyverse that we had a deep flier lag masks stats lag. So there's a lag function built into the stats package which comes as part of base R. The lag fat function in dplyr is a lot easier to use and has some nice features that make it more attractive. So we're going to be using the dplyr lag function. So we can do is lag x. And what this then gives us is by default a one observation lag. And what you'll see is that it takes your input vector that's got 10 units long and it outputs a vector that also has 10 units or 10 elements, but it bumps everything to the right one and replaces that bumped value with an NA. And so we can think of this as being yesterday's data the day before and so forth and so on. So the default is one. We could make this more explicit by saying n equals one. We could also then do lag x n equals two to get a two day lag and we could do whatever n we wanted to get that increment of the lag. So we know how to do step one here now. So let's go ahead and see how we can represent those as separate columns in our data frame. And so again, we have tidy auction data and to keep things simple for this week I'm gonna work with the large lamb data. So I'll do filter classes equals equals large and don't worry, one of the exercises for later is for you to figure out how to do all of this for all of the classes without using that filter. Okay, so we're gonna use mutate. And so I'm going to say lag zero is my midpoint. It's perhaps a little bit silly to create a column that's a copy of another column but kind of conceptually it helps me to keep track of my various lagged columns. So I could do lag one goes lag midpoint n equals one. And so what you'll see in the output here and let's raise this up a bit is that we get two new columns, lag zero which is a copy again of midpoint and we get lag one, which is lag zero bumped a week. And so again, that fills in with an NA a lot like what we saw when we were working with this silly vector X. So one other thing to keep in mind is that we're assuming that our rows are in this correct order. I'm pretty sure that's a safe assumption but if we weren't so sure, we can use an argument called order by. And for order by, we're going to order midpoint by another column. And so what we're going to do to be safe is to order our output by the date column. And so nothing changes but it gives us more peace of mind that we've ordered our lag correctly. We wouldn't wanna be taking lag between two dates that were more than a week apart. All right, so I want to again do a four week lag. So I also need lag two, I'll need lag three. So what I'm gonna do is the same syntax that I have here for lag one but instead of n equals one, I'm gonna use n equals two and n equals three. Let's see how we do this. We could also copy and paste but as you can see, I need help with my typing skills. So we run all this and we see that we now get for our large class data, we get lag zero. So that's this week's observations, last week's observations, the previous week and then three weeks ago. So we have four weeks of observations. Great, we've got our lag columns. Now what we need to do is calculate the average across all those columns. So I'm gonna create another mutate line. I probably could have added this to that one mutate function but I wanna keep it separate and perhaps we'll see why later. I will then call this rolling average and I will then say mean C lag one, lag two, lag three and I forgot lag zero. Very good. And so when I run this, what I get is a column of all n a's for rolling average. Now what's going on here is a small problem in that I have used the mean function and with the C operator to concat, what it's actually doing is concatenating all four of these columns together and then calculating a mean of that and it's putting that value in every row for rolling average. And now because n a dot r m by default and the mean function is false, it then returns an n a value. But if I did n a dot r m equals true, what I see is I get the same value in every row for rolling average. That's not what we want. All right, so back to the drawing board. So what we could do instead is to write out the arithmetic. So I'll do lag zero plus lag one plus lag two plus lag three divided by four. And so we see that our rolling average column then looks pretty nice. I'm gonna go ahead and clean this up a little bit doing select date midpoint rolling average. So now I've got my three columns and I could then go ahead and plot this. What I'd like to do is show you a slightly different way that we could run this to get the same output. So I'm going to copy this down and instead of this mutate line, what I'll do instead is I will then do group by, I'm gonna group by the date and that way then it's using each row separately and then instead of writing out the arithmetic, I could do mean and I could then do see lag zero, lag one, lag two, lag three. And instead of mutate what I wanna use is summarize and it's complaining that classes is not found and that's because I didn't copy down tiny auction data. All right, now it's saying midpoint not found. Let's see. Right, so if we run through this point through the summarize function, we get two columns out. We get our date and we get our rolling average. So I need to add to this my midpoint column. I'm gonna create a new midpoint column. Now when it chunks those by date, so there's one row per date, I'm gonna output midpoint and so that then gives me a column of midpoint and rolling average. Now this works well because each group only has, each grouping only has one row in it. So if I say midpoint equals midpoint, it knows which value to pick. So it works in this situation. So again, these two approaches give the same output. Slight difference in how it does it. I'm not quite sure which I like better and I don't know, I guess if I were to add another lag, I would only need to add a lag four here perhaps, whereas here I would have to add plus lag four and then I'd have to change to four to a five. So in some ways the second approach is a little bit nicer and I actually don't need this final select function. So if I look at the two lines, two sections of code, it's the same length of code. So I'm not really gaining that much. So you pick what you like. But I'm gonna go ahead and charge, I'm gonna save this as large prices and we'll then use this going forward to plot our data. Now to plot the data, let's do large prices and this is all stuff at this point that we've seen before, right? The only thing that we've done new today is this lag function. But again, we're seeing it in context or we're seeing other things that we've already seen like filter, mutate, group by summarize in different contexts. And so it really shows the versatility and power of these different functions. All right, so ggplot, AES, my x-axis, I want to be my date, my y-axis, I want to be rolling average. And that's good. And then I will add geomline. And so this gives me my rolling average. One of the things I notice that there are breaks in the data and we didn't see this before with geomsmooth because geomsmooth was kind of interpolating between our points. And so because we have some missing data, we get an A's in there for our rolling average. All right, so one thing we'd like to do beyond making this look a little bit nicer is I'd like to have my rolling average data on top of my midpoint data. Now, again, if we look at large prices, we see that those two values are in separate columns. So anyone remember from last week what we did to get it so that we would have one column with two different variables? Well, we're gonna use pivot longer. All right, and so we'll say pivot longer and we're gonna do what columns are we gonna do? We're gonna do everything but the date and names two is going to be the method and the values to, I'll call price. Now let me run these two lines just to make sure everything works. Great, we've got a column for date, method and price. And we can then pipe that into ggplot. Our x is still gonna be date, y now is going to be price and I'm gonna make color be my method. And we'll go ahead then and do geomline. And so there we go. We have our smooth line in this green color and our midpoint line, the raw data, the observed data for each week in this kind of red color. Now, those colors do not do much for me. I'm not a big fan. So let's clean this up a little bit to make it perhaps a little bit more attractive. And what we can do is we can add a function called scale color manual. And there's a lot of arguments that we can give it that could modify how the legend looks. The only thing I wanna change right now is the color. We'll come back and talk about some of those other things later. So we'll use values equals and then we're gonna give it the two colors that we want in the order that they're shown here in the legend. So the midpoint, I'm gonna make gray. I like gray as a color to kind of signify that it's in the background. And then my rolling average to make it highlighted in the foreground, I'm gonna make Dodger blue. So we run that and we see the blue line really pops out off of the gray. And that blue line again is representing that rolling average. Now, we've seen in previous Code Club episodes that we can go ahead and clean this up a little bit more. Let's do theme light. I like that with this data because it gives us those grid lines to show us where the years occur. We can also add labs. And I'll say title equals the rolling average smooths our weekly large land prices. And I'll do X equals date, Y equals price in dollars per hundred pounds. That's a percent, not a dollar. And maybe I'll go ahead and do a subtitle and I'll say rolling average is a lagging four week. And then for my caption, I'll put prices as reported by United producers for Manchester, Michigan. Very nice, right? So this looks really good. This is something that, again, does a nice job of smoothing the data a bit. We could perhaps play around with how long of a lab we want. Again, I think this does a nice job of showing the smoothing of the data. Now, to show you a difference, I'm gonna copy this down. And instead of using the rolling average, what I'd like to compare this to instead is the geome smooth. So I'm gonna remove my pivot longer line. And instead of price, I'm going to go ahead and put midpoint. And we're not going to have a color at this point. For geome line, I'm gonna make this color gray. So let's look at what this looks like for right now, okay? And then I can add a geome smooth and span equals 0.1. And we can do SE equals false. And I can get rid of the scale color manual because we only have one variable we're plotting. And I've noticed I forgot a closing parentheses for my prices, right? So that should work. And so we can look at this smooth line relative to this smooth line, okay? And there's like kind of aesthetic differences in the way the plot looks, but I'm not sure which I prefer. I'd be curious what you prefer. Maybe you could put something in the comments below. But one of the things I like about geome smooth is that it's a continuous line. And so, but one of the challenges with geome smooth is that I don't really understand what the span variable represents. What's 0.1, right? Whereas if I have a rolling average of four weeks, I know what that's a rolling average over. So perhaps it's a matter of taste. But again, this is the comparison between using a rolling average and using a geome smooth, what did I do? There we go. So between geome smooth and our rolling average, okay? So I'm gonna go ahead and take a break. I've got some exercises for you to work through to practice using the lag function, as well as some of these other functions that we've been working with over the past few weeks. So the three exercises that I have for you to work with, the first question is, how would you modify the code that we generated for the large lamb prices to calculate either a three week rolling average or a five week rolling average? Remember, we did a four week rolling average. For the second problem, what I'd like you to do is determine how would we calculate the rolling averages for all classes of lambs together? And so build a plot then that shows the prices for the small, medium, large and extra large classes of lambs using the rolling average. Finally, in the third question, it gets a bit cumbersome, right? To kind of do this arithmetic and building the columns to calculate a rolling average. What if we wanted to do a rolling average over 50 days or 50 weeks? That'd be really painful. So thankfully, there's a package called zoo. I forget what it stands for. So go ahead and install the zoo package. It's got a lot of good time series functions in it. And there's a function in the zoo package called roll mean. So I want you to look at the documentation for roll mean to see whether or not you can figure out how to replace all of our mutate lines with one mutate line that calls roll mean to calculate the rolling average. All right, so I'll leave you to pause the video now, work through these exercises on your own and then we'll come back and once you finish the exercises, I'll go ahead and help you show you how I've approached these problems. So hopefully you found those exercises engaging and learned a little bit, perhaps caused you to stretch your brain muscles a little bit. So for this first question, how would you modify the code we generated to instead calculate a three week or a five week rolling average? To do this, I'm gonna go ahead back up and pick either of these functions, they're not functions, code blocks and copy it down here to my solution. And again, if I wanna do a three week rolling average, I need to remove one of these rows, one of these lags. So I'll remove lag three and then I'll remove lag three from here as well and I'll divide by three instead of four. Again, you could have done this with the group by summarize approach as well. We run this and sure enough, then we get our three week rolling average. We know it's a three week rolling average because the first two values of our rolling average column are NA values. If we wanted to get the five day or five week rolling average, we would go ahead and I guess we wanna add for lags three and four and I need to update my end to be three and four and then I need to add lag three plus lag four and I need to divide by five. And again, we know that it's a five week rolling average because the first four values of rolling average are NA values. Again, I'm not gonna do it now but you could then feed these into our plotting pipeline to see how that affects the smoothing of the curve. All right, how could we calculate the rolling averages for each class of lambs? And so we wanna build a plot that shows the prices for the small, medium, large and extra large classes of lambs. To do this, I'm gonna come back up and take the code that we just copied. I'll take this first one where we use mutate rather than group by and summarize. Again, it doesn't matter. Those two approaches gave us the same value. So what I'd like to do is I'm going to group by my different classes of animals. So instead of building separate plots, I wanna build one plot with all the lines. So instead of filter, I'm gonna do group by classes. And again, if I run these two lines of code, let me make this bigger so we can see the output nicely. We see that we've got the same data frame but it's now grouped by the classes. And so now when we run this mutate function, it's going to calculate these columns based on the data within each group. If we didn't have group by, then we would be calculating the lag between our different groups and that's not at all what we want. So we need this group by classes before we do our lags. And now if we run this, we actually see that for our different classes, we have different lengths of lags, right? And so they all start on August 31st of 2015 and then they go to September 7th and then we'd be September 14th and so forth, okay? So then we could then calculate the next line with our rolling average to get that column. What we can do would be print n equals, let's do 50 to get 50 lines of output. And so we see that we've got our different classes in our mixed here, but we're grouping by them as well as, yeah, as by those classes and we get our rolling average. And so that works very nicely and works good. Okay, so to clean this up a little bit, and again, it doesn't matter. It's a matter of making things look more appealing to me. I'm gonna do a select on date, classes, midpoint and rolling average so that I then have these different classes, dates, midpoint and rolling average. One thing I do notice is that I still have the data grouped by classes because we didn't do a summarize, it did not ungroup them. So I'm gonna add to the end of this and the ungroup function to ungroup the data. It probably doesn't matter, but it makes me feel better knowing that things aren't grouped any longer, okay? I think we talked about ungroup way back when we went through the group by summarize idiom. If you do group by on two variables and then summarize, you still have a grouping variable. So you need the ungroup to get rid of that grouping variable. All right, so this then I'm going to call rolling prices. We'll save that and now we wanna plot our rolling prices and we'll do ggplot, aes, x equals date, y equals rolling average. I'm not gonna worry about plotting the midpoint data and for color, I'm gonna do classes and then we'll do geomline to write those data out and again, it kind of looks like the large prices but for all the other classes as well. So if I only want these, the sized classes, I can add my filter line kind of like we did last week where I can do classes equals small but I need two equal signs and then the pipe classes equals medium, classes equals large, classes equals extra large and then I'll pipe that into ggplot and sure enough now we have four lines for our four different classes. Again, I'm not sure this is so aesthetically pleasing. We could look at smooth and maybe you could do this yourself. Let me write these down here. Instead of rolling average, I could do midpoint and I could do geom smooth and I'll do span equals 0.1 and se equals false. So again, I don't know. I kind of like the continuous line but it is more satisfying to know what the line represents, right? The span function is a little bit lost on me and a little bit esoteric whereas a four week rolling average, I know what that represents. All right, so finally, we wanna install the zoo package and look at the documentation for roll mean and see whether we can replace all those lines we had with a single mutate function that calls roll mean. So I'm gonna come back up and grab one of these for the large class. I'll let you mess around with doing exercise two with roll mean, but for now we'll use classes large. I'm gonna get rid of those three mutate calls and I'm gonna modify this to use roll mean. I'm not sure how to use roll mean at this point. I know that I'm gonna have to call library zoo and if you're not sure if you have zoo installed, you can go to packages and then in the finder window there, type zoo. I've got it installed. If you don't have it installed, click on the install button, type in zoo under packages and then click install and then it'll go through the dialogue of installing zoo. So I'm gonna go ahead and load zoo, get the check mark, it's been loaded. Not quite sure how to run roll mean. So I'll do question mark roll mean. This shows me rolling means, maximums, median sums. So roll mean takes x, the vector that we wanna get a rolling mean for. K I'm gonna guess is the lag or the number of days in my rolling mean. There's some fill parameter. I'm not quite sure what that means. Na pad is false. Align, center, left and right. Okay, so I'm gonna start simple with x and k and we'll come back to these other arguments if I think it's necessary. So we'll do roll mean and x is gonna be midpoint and then k I'm gonna do, what is k? The integer width of the rolling window. So I'm gonna say four and roll this. So I get an error that says rolling average must be length 279, not one, not 276. So we've got some missing data. And so if I remember the arguments from over here for roll mean, I see na pad is false. So I think I wanna do na pad is true. And sure enough, that puts that in and gets me my padding of with the na value. But something that's kind of catching me off guard, I'm kind of stumbling here. So I don't have one na value, not three. And so I think what's happening by default is that align is center. And so it's using one of those center days to align the rolling average. What I really want is the right. So I'm gonna do align equals right. And sure enough, I get my rolling average that starts with three weeks of na's and then picks up with 162. And if I look at large roll, what do I call it, roll, what do I call that thing? Large, large prices, sorry. I get the same output, right? So roll mean from zoo, if you're really interested in doing rolling averages does the same thing as what we've done with all of our mutate lines and using lag. I wanted to show you lag because it does come up in other settings where it's useful to have this lag. But again, it's nice to know that we've got a package like zoom with its function roll mean that gets us the same output. This is a lot nicer because instead of having to change all these other lines, if I wanted a three week rolling average, I'd change one number and I'm good to go, right? And so that's pretty slick and works pretty nicely. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better, it would be for you to take the data and tools you've worked on today to answer your very own question. I'd love to see what you did with rolling averages. Please feel free to drop me a line in the comments below to tell us what questions you were eager to answer. Also, please let me know what types of data analysis questions you have and I'll do my best to answer them in a future Code Club. Be sure to tell your friends about Code Club and to like this video, subscribe to the Riffimona's channel and click on the bell so you know when the next Code Club video drops next Thursday. Keep practicing and we'll see you next time for another edition of Code Club.