 Okay, so welcome to Data Wrangling, the last, this is the last workshop in our series. So it's using this same tutorial from the last two workshops, the intermediate and data viz workshops. So I'll make this little overview very quick. But as you know, as you go through this tutorial, you can click to next topic, you can click to previous topic, you can navigate using these buttons on the left. And then there's also at the very bottom, like a start over button usually, there is the bottom here. So that'll reset all of the work that you've done. And then another thing you've seen in past workshops is you can click here to get a solution if you are stuck. And also, as we've told you, your code might not look exactly like that code, but it could still be right. Great. Any questions about the tutorial, you guys all know what's happening, right? Okay. So today, we are going to talk about, we're going to use the tidyverse, which is an R package that gives us a little bit of a different functionality than the base R that we've been using, that we used in intermediate. So in today's workshop, we'll talk about these two packages, dplyr and tidyR. And we will learn to select rows and columns in data frame, we'll be able to link the output of functions using the pipe operator, which is a new operator. We can add columns using the mutate function, we'll use split apply combine to produce data summaries, and then we'll also summarize group by and count to sort of split a data frame into groups of observations, apply different statistics and combine results. So we're really like moving the data around today. We'll talk about concept of wide and long tables, describe what key value pairs are, we'll reshape a data frame from long to wide, and then we'll at the end export all our data into a CSV file. So we're using the term data wrangling in this workshop rather than data manipulation, sort of to highlight the idea of reproducibility. We think of manipulating data as just moving it around, changing the leading rows, merging files in a non reproducible way. So when we say wrangling, we're talking about documenting it in R, using these functions so that anybody who's looking at your code can reproduce it just as you had them in to begin with. So, I think we've talked about function or packages rather in R, they're basically like big sets of additional functions that you can use. So we've been using like data dot frame as a function that come built into that basic R. But today, as I said, we're going to expand our repertoire using the tidy verse package, when you, what do you call it? When you load. So what's the word you use for packages? When you, yeah, when you install the packet, thank you. When you install the packet, the tidy R package, or the tidy verse package, it also installs tidy R, dplyr, ggplot, tibble, and all of these other packages that you need that it relies on. So it's the tidy verse addresses three common issues that you might have in base R. So it's like an improvement of R, you might say. The results from base R functions sometimes depend on the type of data. Using our expressions in a non-standard way can be confusing to new learners in base R and tidy verse helps with that. So base R also has some hidden arguments and default operations that can be confusing as well. So the tidy verse is trying to solve some of those problems. For example, if you remember in intro to R, when we imported data, we had to set strings as factors to false. I don't know if you remember that, but that's to avoid a hidden argument that would convert our data and confuse us. But in the tidy verse, we don't have to do that. Everything's already set for us. So if you were going to install the tidy verse, if you were using RStudio rather than this online tutorial, you would type install packages tidy verse and straight into the console, and then you would load it using the library command. So these two packages that we're using today, dplyr and tidy R. dplyr is built to work directly with data frames. So it optimizes some common tasks that you might use. And it also allows you to work directly with data stored in an external database if that's something you want to do. You can conduct queries and only the results of the query are returned. Also a common problem in R is that operations are conducted in memory, and so the amount of data you can work with is limited. But the dplyr using these database connections allows you to conduct queries directly and pull back only what you need for analysis. So it causes your code to work faster. And then tidy R addresses problems about reshaping data for plotting and use by different R functions. So sometimes we want datasets where we have one row or measurements, but sometimes we want a data frame where each measurement has its own column and rows are more aggregated groups like plots. So moving back and forth between those types of formats can be difficult. So tidy R gives you some tools to do that. So you can learn more about these. That's just a tiny overview of these two. We have a couple of links in the tutorial to some cheat sheets that give you some basic functions that are within each of these packages. Okay, so let's talk about today we're using some survey data that it's time series data with measurements of a small mammal community in southern Arizona. It's real data that was collected by scientists and they use it in the carbon trees curriculum, which is what our workshops are based on. And we cut the data set down a little bit to make it easier to work with, but it still has 11,000 observations 11,000 rows and the study has been going on for 40 years. And so I just wanted to show you the data we're going to use today. So it's basically measuring how many rodents go through plots of land. So we have record ID for each observation month, day, year for when the observations happened, plot ID for each plot that the scientists plotted out, and then species ID for the type of rodent coming through sex for the sex of the animal hind foot length to show how big it is and weight. And so one thing one change for today since we're using the tidy bears is we're going to use this read underscore CSV instead of the read dot CSV that you use in basic art. And so on that code to to pull the data set into our tutorial. So you can see identifying the type of data here so it's saying species ID and sex are characters record ID month, day, year, et cetera are all doubles which is a numerical data type. And then you also, that's this column specification so when you use read underscore CSV, it actually looks through the first 1000 rows of your data set, or your data frame rather. And it guesses the type of data that's there so that's why it's telling you I think these are characters, I think these are numbers, and it's right. You can specify manually if it's gotten it wrong, you can use the column types argument when you do read CSV we're not going to do that today. So, you can inspect the data by using the glimpse command. It gives you like a little overview okay nine columns it gives you the first few values in each of these rows. It tells you again the type of data double or character. And then you can also preview the data by using the view function, which in our studio you would do that. So here you can see it more like a spreadsheet, which might make more sense to you. In the tidy verse the data is stored as what's called the table, which they tweak some of the behaviors of a data frame, you don't really need to know all the differences but we, the table is trying to be more user friendly and so in addition to displaying the data type of each column under its name, it only prints the first few rows of data and as many cons fit on the screen. So this is helpful. And then, as we said before, the class character are now converted into factors. So today, Greta and Harley and Elliot are going to teach you about all of these deep liar functions and show you how to wrangle some data. So, let's get started. Yeah. I was trying to read into my own data and I labeled it something really like annoying. And so then I was like, oh, I'd rather not have it be called that and I wanted to rename the data set. But I didn't know if I was allowed to do that. I think you are. Oh, yeah. So, I, what I would do is you just whatever your name you want. Sign that arrow and then you didn't like. Oh, and then that's great to copy of it and then you can use the RM function for renew in parentheses remove the. Okay. So now that Sarah got us warmed up. Thank you. We're going to work with some basic functions in the tidyverse package to get started and work towards more complex ones. So, first thing that is common practice that you might want to do if you have a data frame and you're only interested in certain variables or you only want to explore certain variables. There is a way to select certain variables and only deal with a subset of the data. So to do that, I know it comes as a big surprise. This is called the select function. So this lets you select columns. Now, the arguments in a select function, the first one is going to be your data set name and the second one are going to be the variable names you want to select just separated by commas. So, in this first example, we want to select plot ID species ID weight and weight from the survey data set. So data set name is always first. Now we're just going to type in our variable names. And then we're going to type in our columns. Make sure I'm actually spelling these right. And when I run that, it will give me just those three columns of the survey status set. So you may remember before to select certain variables, we had to use that dollar sign notation to index into them. This essentially is a way that you don't have to use that. So, and also just heads up as before we have this little solutions tab. So if you want to reference this at a later date and just come back to this code and maybe use some of it borrow some code. I would highly suggest you attempt the activities before looking at the solutions but if you're stuck in need help, that is there for you to. Okay. So as well as with selecting certain columns or certain variables we want to look at, we can also select ones not to look at. We don't want to see all of the columns except for two. In this case, the two we don't want to see our record ID and species ID, we can put a minus sign in front of them. So, I'm going to do the same thing as before data set comes first. And then it's going to be the variable name, but I'm going to start it with a minus so minus, and then my very, and we're going to do the same thing for species. It'll give me that data frame with those two columns excluded. So far. Okay. Okay. Okay. Next, we can also use filter. So we use the select function to choose columns we want to look at the filter function lets us choose rows based on a certain condition within a variable. So what we are going to do the arguments in the filter function are very, very similar. So it's going to start with your data set name, comma, and then you're going to have a variable name with a condition. So we can see in this example, the variable we're going to look at is the year. And then we're going to use this double equal sign. Remember back to logicals that that's saying is actually equal to 1999. So we only want the rows where the year is equal to 1999 is what we're looking at here. So we want to keep all of the variables, all of the columns are here. But only the ones where the year is equal to 1999. Just a refresher for logicals to we can use some of that same notation to do different conditions here. So we have that bang or exclamation point equals this is not equal to so it essentially negates it. So we have the pipe, which is or, but the third one we talked about was and this is where it gets a little different here instead of using the ampersand we're just going to use commas to separate all of the ones we want and and and. Same as last time, just the little carrot for well not the carrot the lesson sign for less than greater than sign for greater than. And then if you want to do less than or equal to or greater than or equal to it's exactly in the order is you would read it so you always put the whatever they on first and then the equal sign. When we are walking with the examples. We're going to look at this one is saying not equal to so it's going to give me all of the rows that are not 1999. See here, and if you do actually take the time to scroll through all of these pages you will not find a 1999 I promise will save you time. This one, this is what we use to talk about and here. So here I'm conditioning on two separate conditions for two separate variables. So I'm keeping the same format variable name condition logical and what I wanted to be. So I want all the years or I want all the observations only for 1999 and that have a plot ID of two. So when I run this. All of my years in 1999 and all of my plot IDs are going to be to the or example does one or the other if one or the other is true we're going to keep that row in. So as you expect, if the year is 1999 or the plot ideas to it will be included in this data set. So you can see here we have some 1996 but plot ideas to so it was included. How are you feeling? Good. Everyone has the tutorial. Got the URL. Thank you. Okay, here's another example with a less than so we just want to see the rows where weight is less than eight. And I believe that's in brands. Pretty confident. Yeah. So all of the weights are less than eight, same greater than we're going to look at hind foot length greater than 30. So all our hind foot links will be greater than 30. So, okay. Okay, now that we got past some difficulties, we're going to talk about pipes. So this may sound vaguely familiar, because when we talked about the logical operators that bar the vertical bar that we talked about for or that's called a pipe, but we warned you that that's also the name of a more commonly used thing and So we usually just call that a bar to keep those separate. Now we're going to talk about the actual thing of magic that is used. Well, I use more often. I don't know if it's everybody's but we're going to use the pipe. So the pipe is essentially a way that lets us select and filter at the same time. There are three ways to do this but we're going to show you why we like pipes the roundabout way. So the first way you could do this both select and filter is with intermediate steps. So when you're doing intermediate steps, you're essentially going to save a temporary data frame and then do something to that temporary data frame to get your final data frame. So if you look here, we're filtering surveys first, saving it as a temporary data frame service to and then to our temporary data frame service to we're selecting the variables species ID, sex and weight, and we're saving this as our final data frame So when I run this, it's just going to save it when we can print out surveys of the nest. And when I print that out, we can see all of the weights are less than six, it did that correctly, and we only have these variables we're looking at. The next way we can do this is to nest functions. So when you're nesting things, you're essentially putting one inside of the other. Now, how our reads these is it always works from inside to out. So it's going to do whatever is in the innermost parentheses or function first and then work its way out. So here, the innermost function is filtered. So this is what it's going to do first is going to filter surveys first, and then take that because usually the first argument for select is the data set right. So it's essentially going to treat this as a new data set. Have that as the first argument that from that behind the scenes data set, we're going to select those three variables. And so we're going to save that survey small. We're into that again. Oh, I'm very good at spelling. And then we get the same output. So it's another way to do the same exact thing. Now nested functions are tricky because if you're looking back to reference code that you have nested, it's not easily understandable to know what ours doing first what order things are being done in, or just understand what is going on in what order right. So we want something that's a little more concise than the intermediate steps, but also easier to read the nesting functions. This is where pipes coming handy. So pipes are this little funky thing. That's the percent sign. Is it greater than the percent greater than percent sign. And we have shortcuts on your keyboard. So you don't have to type this every time. So if you have a PC, you're going to do control shift M for the pipe. And if you have a Mac, you're going to do command shift. And it should pace. Package in the McGridder package. Yes. So we have like, we talked about nesting functions. There's essentially like nesting packages to because big umbrella we have tidy verse within the tidy verse we have dplyr within dplyr we have a McGridder package. So if you load tidy verse you essentially have access to all of those, which is really nice and then you only have one library code instead of 50. So you have to have that loaded and able to to use the pipe operator so R knows what it's talking about. But once you do, it makes things very easily readable because you can do things in order. So we're going to look at this we're going to kind of like read our way through this. So essentially what the pipe does is it takes the thing behind it or to its left and it throws it into the first argument of the function on the right, or whatever comes after it. You do not have to enter down after each of these, but it's just for sanity for my sake to like look at it. It's less like convoluted if it's in one line, but you can also put it all in one line too. So what this is basically doing is it's saying I'm going to take this data set that is the first argument in our filter function. Right. So it's piping this into the first argument. And then all I have to do is put my condition there. And then it's saving this behind the scenes is kind of that intermediate data set piping that that intermediate data set is going into the first argument of our select function and saying what variables I want. So if we run this, we can see it does the same thing in an order where it's a little more concise, and it's easier to see the order of operations. So if we want to create a new object, we can. So here we're giving it a name to actually print out and see what it is. We've got to call the name. So when I run this code. This is just saving it lines with you see three saves it as survey small, and then I want to see survey small. So I'm calling its name. Well, so now we're going to get our hands dirty. We're going to go in the challenge one is ready. Okay, so we know that one pro with using pipes is that we can do things in the order they're asked of us use sometimes more so than not. So for this challenge, we're going to use pipes to subset the survey data set, we're going to include animals collected on our after 2001 and retain only your sex and weight columns. So, very first thing always, it's my data set name what's my data setting here, survey is awesome. Now I can use that little pipe guy shift and on my PC. And I'm going to enter down a line just to keep it a little organized for me because it helps me see step by step what's going on. Next thing I want to do animals collected on our after the year 2000. So that's selecting based on a certain variable I have a condition. So am I going to use filter select for that. That 5050 shot you got it. We're actually going to use filter. It's close. I know, I know, I used to triggering so I got to mix up my voice inflection otherwise they just read that and they don't actually turn so. So we want year and we want the ones greater than or equal to translates to honor after 2001. Now that's going to create our intermediate data set. I'm going to type that into now is where your answer comes in what function are we going to use next select. Select. And this is where I choose what variables I want to see. So we only want to see year. We want to see sets in. So you have to put your in there again. Yes. Because this is filtering just what rows you're doing, but to pick what columns you want to see that's, you need to respect depends in this case. Yes, you could. I'm pretty sure you could, right? Yeah. If you didn't want to see your again. Then the order would matter, right? Because you're from your, because of your select function that it's no longer available to use for military. But if you're keeping retaining a year than the orders. That's a good question. So we call that conditional piping is that correct term. So, well, something like that. So if one of your statements depends on something that comes before it. The order does matter, but in this case where you could do it either way and get the same result. You're fine to do it. So this is where you have to think of does is it dependent on previous steps. The great question. Thank you for asking. I'm going to try out a new function. It's called mutate. I love the mutate function. So the mutate function is used when you want to create new variables based on existing variables already in your data set. So this is very helpful in unit conversions. And that's what we're going to use for our examples with mutate. So to create a new variable, the arguments inside mutate first one is always your data set, but with the pipe, we can already put that into the first argument and then have it automatically read it. And then we're going to name our new variable. So this weight kilograms, that's going to be the name of our new variable in our data set. And I'm going to calculate this based on the existing value weight, and I'm going to divide it by 1000. So now it's going to save me a new variable. I run that code. Since we didn't save this, it automatically prints it out. If you do save it and give it a name, then you're going to have to re print the name to get it to print out. So we can see here, it took all my weight values and divided them by 1000. Notice that for missing values, it can't do math on missing values, so it keeps those systems. Okay, now we're going to create a second column based on the first one we create. We can do these both at once, actually, which is really cool and mutate gives us a lot of power. So same idea. First argument is the same. Separate your arguments by commas as always in functions. So weight kilograms to be calculated the same as before, then based on weight kilograms, I'm going to multiply that by 2.2, and it's going to give me weight and pounds. So I'm creating two new variables. Here order does matter because you have to create the kilograms if you're going to do something with it. Does that make sense? It does. Yeah. So when we run this code, I have two new columns. It's cut off quick. So we have two columns here, both by way of functions based on the weight variable. So here, sometimes when you're viewing your data set, it's kind of a pain in the butt if it prints out the whole thing, especially if you're turning in like a big report or like some answers to your boss, and they want to like get an idea of what the data set looks like or you want to make sure it transforms this variable correctly. So you don't need to see all of it and it maybe it would be nicer to see just a little chunk of it. So we have a really nice function. That's just the head function that by default will give you the first six rows of output. So now I just have the first six rows and I don't have this hundred pages of data to go through. So the tail function, which will print out by default, the last six rows, you can also specify how many rows you want. If you do, I think it's just n equals 10, play with it real quick. And then it'll give you the first 10 rows too. Same with the tail function. You would just replace head with 10. So one thing we do notice when we print out the first however many rows is we have some missing values, right? So one thing we noticed before is if the value is missing or can't do math on it, it just keeps it as an A. So if we want to filter out NAs and only keep rows that we actually have observations in, this is one way to do it. So let's walk through what this code is doing. So first we have our data set piping it into the next thing. I'll feel good about that. Yeah. Okay, so now we're taking filter. So we're selecting rows based on a condition. So if we just look at this is not any weight, this is just going to identify which rows have an NA in the weight variable. Now the bang in front of it is going to negate that and it says I only want to select the rows that don't have any so that bang in front of it negates the thing. So we're going to take that temporary data set, pipe it into mutate. Now I'm going to make that variable again, and this is the same as we've seen before. And if you only highlight part of the code, it's not going to run over. So here you can see our NA and weight was from row four. Now that rows been deleted and shifted up or five was shifted up one. So is there a way to take out rows with with all any can you remove all any yes so that one you would not specify on a certain variable you can just do is not any of the whole data set can't you So there there's a you can do it based on any is the whole data set we usually use any dot omit, which just like removing this rose has any missing us in him. If you want to remove rows where the whole row is missing. There is a little bit more complicated, but there are ways to do that. And then, you know, if you want to remove rows where all the numeric values are missing. There are ways to do that too, but it gets it gets more complicated. If you want to be more specific about which ones that you don't have to specifically say. You don't have to list out all of the columns and check for missing us and all of the columns. But that table. And you can notice to that since we just specified weight, it only removed the row that had an NA in weight and we still have our NA is a kind of length and other variables to. So now we've got a spicy challenge thrown at you early today. Are you ready. So, this one has a lot of steps, but we're going to work through it and we're going to get there. You got this. So we're going to create a new data frame from the surveys data set, we're going to give it a name that's high or surveys, high foot centimeter. It wants to contain only the species ID column, and a new column called high foot centimeter, which is created from high foot length. We want to retain only values and high foot centimeter that are not in a's and are less than three centimeters and then print out the head. So, we'll work through this together. You want to give it a shot on your own. Okay, you got this. Give you a couple minutes. Okay, I'm going to interrupt you. What's the first step you did? Some brave soul. Let me know. I love it. Let's do it. So we're going to use surveys, right? Let's get surveys. First step, we got to give it a name, right? So I'll start that out and name it surveys anyway. So I'll just tag on top of this to give it a name. I use that assignment arrow, which is just the less than and dash. I'm going to use surveys. I can spell it. Then what's your next thought process? Awesome. So we know we're going to eventually type this into something. I'm going to put a bunch of the pieces on here and then we'll rearrange them a little bit later. You'll go that. But just don't be surprised when everything's out of order. So next we can work with mutate. First thing that goes in there, we know we're going to pipe in a data set. So we don't need to specify that first, but we're going to do the new variable, which is centimeter. And how am I going to calculate that? I want centimeters. It gives it to me in millimeters. So I'm going to divide that value by 10 to get centimeters. My next piece. Oh, goodness gracious. My next piece. It's saying that. No, you're divided by 10. No, because do centimeters to millimeters, then you multiply by 10. No, you're not. This gets us every time. You got it right now. I reviewed this before to make sure I promise because I thought I was multiplied too. It's not just you. That's okay. If that's the part we're struggling with, I think we're doing pretty good. Okay, so it also tells us at the beginning, we want to select two columns. So the species ID column and the new column we made, right? So let's do that next. So we're going to use the select and then we're going to pipe something in a temporary data frame so we don't need to specify that first. And then all I have to do is list my variable names. Other things you got to do. So the first part of the third bullet point says we only want to retain values and foot centimeter that are not missing or not in a. So to do that, we're going to use that little trick where we're going to find the NAs and then negate it. So here we can use that filter function. Then we use stuff. So is not in a when a nest. And then I want to look at specifically. Actually, let's think about this for a second, because as we saw earlier, right? If there's an NA value, it's not going to be able to calculate or do any math on it, right? So I almost wonder what if we put this before the mutate to get rid of them early and then it only does math on ones it can do math with. So we take this before mutate, then we can just talk about hind foot length, and then it'll take just the values. If it keeps just the values that aren't in a, it can actually do math on that. So here we haven't created the new variable yet. So what length I'm going to use the name of the old one. It's fell right. But here it's just identifying the ones that are we want to negate it with that bang. Next, we have one more part of that third bullet point. So it says and are less than three centimeters. So on this one, we actually don't have to put it in the same filter function because we only want the ones that centimeters are less than three. So once we've created centimeters, we can add another line down here. So do we want it to filter after it selected these variables or to filter before it selected those variables after I don't think this in this case it would matter. You can switch the order of those two, however you want to filter. And I want four foot centimeter. What's my condition here. Less than three. Perfect. Okay. So I'm going to go back through the ask your question. Yeah. Well, you have to pipe every line. Exactly. Great leading question. I'm so glad you can read my mind. So I'm going to go back through and I'm going to add a pipe to each of these lines, except for the last one. So if we all want to do this all in one big step. Actually, no, we can't do that. We have to split it in two steps, because the very last bullet point we want to print out the head of the data frame, but the very first thing we did was save it as a new name. Right. So this is just going to save it. However, we run it. If we put another pipe and add head, it's only going to save the first six rows as the new data frame. So we want to do that separately. We do head of the or another way I can do this is take that data name and pipe it into the head function to. So if we run this across our fingers. And it looks like it worked. So we have our 2 columns that we wanted we have the value calculated correctly. So this is 1 that the point you brought up earlier where it is dependent on the order. Some of them are some of them are in this case. No matter if you needed and then filtered out right. Yes, then you would have to do just based on the centimeter on the new variable. Exactly. Depending on the size of your data frame, let's say that you had billions of rows. Then faster than you can build that down to a smaller set. And the path is easier. It doesn't take as much. So, but for smaller sets doesn't. Billions I felt there as much as you can. So now we're going to look at the package. This is called the Luba date package. And it is a lifesaver when working with dates because a lot of dates reason I asked for a marker. It's because for writing today's date. You may see a lot of dates entered in 1 column. Like this. So for those of you online, I wrote 10 slash 19 slash 2022 because that is today's date. All in one cell under a date. R does not like this. R throws a fit. So the Luba date package makes our life so much easier and lets it be nice to date. So, what we're going to look at first is we're going to look at this today's example. So, we're going to library. You're totally good. We're going to library the Luba date package. The today function will give you today's date. And now we'll give you today's date with time and time zone. So if we look at how this is printed out, it'll give us today's date. Notice that here they use dashes instead of the slashes I had before. And with time, it's separated by colons and times up. So, I did skip over these, but now we're going to bring our attention back up to these. So, we got a YMD function, we got a DMY. These stand for year month day. So Y is always year, M is always month, D is always day. The function you want to use depends on the format your date is in currently. So, it's not how you want your date to be. It's going to be how it's formatted currently. And then it's going to just essentially read it in really nice and separate those. Now here, when it's separating them, it needs to know what to separate on. We know ours are separated with dashes right now. So we're going to, in the separate argument in year month day, because that's how this is formatted, year, month, day, YMD. We're giving them names for the variables like the columns we want. And then we're going to separate how it's going to tell where to cut it off. We're going to paste that little dash. There's also the second part in this mutate function where we're creating a day of the week. So we also have this W day function where it takes the date and actually tells you what day of the week that was. So you don't have to go searching back through calendars, which is awesome. So, when we go ahead and run that, we're going to save that as a new surveys with days and we're going to look at the head. So this is going to give us our six rows. We scroll all the way to the side. Now we have year. Oh, we already had year. We have date, month, day and year, our new months. So it's split that up for us very nicely and you can double check that I did it correctly by comparing these columns to this column. Okay. So there is one weird thing. So I want to just look at the days of the week variable and get a summary of that. I'm looking at that. We have some weird stuff. So we have Sunday, Saturday, Wednesday, Monday, Thursday, and we have others in NAs. That's not what I would expect for day of the week. Right? I don't know what day of the week NAs. So there are some things we can do to figure out why this is happening. That's actually your challenge three. So here we're going to figure out why were they unable to be converted. So here we give you a little bit more some functions to play with too. But I think we've already spoiled it because this last little bit is walk through what it's doing. So we're taking our new data set filtering only the ones that are NAs because we only want to look at those and look at why it's giving us that. And we want to select the month and day and make a table out of it. So what this is doing is this is formatted a little funky, but it's saying the day is the 31st. The month it's comparing April and September is four and nine. So it looks like we have April 31st and September 31st and some observations, which in September don't have a 31st. There are only 30 days in that month. So I think this means 70 observations. I'll say April 31st, which is a made update. So Luba date is super, super smart. It already knows that these aren't real dates. So it flags them and says they're NAs. Isn't that crazy? It's so smart. Same with September 31st. Now let's talk about some character ringling. So first we're going to hit you with a challenge right away. Are you ready? So we're going to inspect the day of the week variable. We'll see it's an ordered factor. So we want to see what the names of the day of the week were taken from the date. And this is good practice always, especially when you create a new variable that you didn't specify what the levels were going to be. It's good to know what it automatically set the levels to so you know what you're working with. So here we can use the levels function. And then all we have to put in here, here we can use that dollar sign notation, which is the surveys days dollar sign to index into a variable. And I want to get the day of the week. So we can see it has the days of the week in an order that is what most calendars do, but it's only the firstly letters abbreviation of it. So that's important for us to know because if I would have thought it was the whole month and I tried to search for a Monday, it would have acted like I was crazy. Right. So what if we did want the whole name of the day of the week if this is going in a published article or something they want the whole name of the thing. So how do we do that? I guess tell you about my favorite function now I'm very excited. So this is the case when function it saved my butt so many times. The case one function is essentially a way to get you out of a bunch of if else statements. So especially when you're mutating a variable. So creating a new variable, you can essentially specify levels without having to do if it's equal to this, then we're going to do this if else is going to equal to that's a lot. Right. So we want to be more concise case when function is a lifesaver. I love this. So let's look at what we're doing here. We're taking our surveys with days for giving it a name after we've done some stuff to it. We know what the mutate function does. So we've talked about how this creates new variables. But if you notice, I have the name of an existing variable here. So what this is doing is it's actually overriding that variable. We don't recommend this a ton. Just because if you do something wrong, it's going to overwrite your data. And then if you need to go back to that for conversion or you realize something later it's maybe very difficult to get back. In this case, since we're just altering the names of levels, we're pretty safe. But in other cases, it may be better to create a new variable like Dave week to or something like that. So, here we're using the case one function, and it has specific arguments is every function. So in the case when we're going to have a condition, comma, the output if that condition is true. So what it does, this whole thing before the comma the first comment here, or actually not between the first comma, it's going to be tilde if that condition is true. So this is my condition. I'm saying if the day of the week is actually equal to this double equal signed m on for Monday, then a tilde. I'm going to say what I want to change it to what value I want to output. Then I can comma and do that for every single day of the week and write them out. And then we can run this code. And this glimpse will want to see that now it is Saturday, Saturday, Saturday, these days are written out full. How are we feeling? Done. Okay. So case one is awesome because there's like a bajillion options you can do with it and it saves you a bunch of time and energy. If we only want to recode a couple levels of a certain variable, we can still use case one and then specify behavior for all the levels. So here, same start, we're naming it using our surveys with days, you take and creating a new variables weekday. Weekday is going to be conditioned on if the day of the week is Saturday, I'm going to set it equal to zero. Sunday, I'm going to set it equal to zero. The day of the week is Friday. I'm setting it as numeric and a so that's our special way of saving something as a missing value because you can save an input as an a but if you don't have that as numeric. You won't recognize that it's actually missing. Right. It's mixing types. Yeah. When you're creating a variable, you have to be consistent on the type so that it knows what type to assign it. And so it means just a little bit extra help with missing values in case. So, here we specified the variable and our condition. And after, if we don't specify our like variable and a condition, we're just saying this true means all other cases. So whenever the day of the week is equal to anything else, we're going to have it equal to one. So, the second part says that new data set we created, we're going to get counts of each of the weekdays. You just run true it'll output truthfully. And here we can see that we got about over 8,000 zeros so weekends. And we got 329 days. So that's the amount of Fridays we have because that was the only specification we saved this Friday. Right. So, one thing that's a little tricky. So I'm going to go back up here real quick. So I'm not going to be able to use levels function. I'm going to use the table function. I am going to look at so I can double check that by Friday has 329 observations. But notice that the days are out of order now. What order is a default putting it in alphabetical exactly because our likes the alphabet. So, what if I don't want it in that order it always defaults that order so there is a way where we can actually get these back and specify the order we want. But first to do that, we're going to kind of investigate more what the data type is of day of the week. So, we're going to look at I already did table of above. So table of that data again, saw that, and then to do type of or the type of function unless you see the data type so we're going to type of. So when I run this I get that chart again, and it's telling us it's a character variable. Since in this case we have a small number of values it can take on and their specific values that are repeated. It makes sense to make this into a factor variable. So that a factor variable just has specific levels and lets you use the levels function so it's a little more. It's just a subset of character variables. Is that the correct way to say that. So characters are anything with like letters and strings of that. And a factor variable is specific ones that have levels. So here we can do that by using the mutate and then a rewrite over this variable, which is not advised most of the time but okay in this case. And the four cats function, which is not like go Bob cats it's for categorical variables, which I did not know when I started. It has a function called factory level so it actually lets us change this into a factor and re level it at the same time. So it's going to say I want to overwrite this by taking this variable and I'm going to put the levels of it in this order. You don't have to enter down every time. That's just for easy readability for our sanity, but these can be all on the same line as well. So, then we're going to glimpse it at the end see Dave week, and we can see seven levels Monday Tuesday. It starts with Monday now. Okay, now we want to verify that are put the days in the order that was specified. So, to do this, now that it's a factor variable, we can use that levels function. So levels. I want to see going to copy this little bit and blinks to see all of them. And it's Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday in that order. Next, we're going to look at some split apply combined and Ellie it's going to take over. We're going to get going. So we've went over how to select filter mutate pretty basic data wrangling stuff. So now we're going to get into a little bit more of like, I don't want to say intermediate, but it's kind of taken it to the next level. So first two main functions that we're going to really be working with our group by and summarize. So group by is essentially you specify what variables you want to group. So it'll take, so say we were grouping by sex in this example here. It'll tell us for each sex, whatever we want to summarize and the summarize function is really used a lot in tandem with that. So we're saying, take our edited data group by sex and summarize the mean weight. And then we specify essentially what meat weight is. So the mean of the weight variable getting rid of anything that has NAs and our output. So we have females, the mean weight for them mean way for males and then any NAs note that these NAs are coming from sex and not the missing data for weight. So we can also group by a couple of other variables too. So here we're going to group by sex and species ID and calculate mean weight. So now for each combination of female and sex and species ID, we have a distinct mean weight. So we've got all female and species and all male. And then we can see for the DM species, we are missing that sex column. So the cause of all of our problems, right? So now rather than removing mean weights when we're calculating that, or rather than removing any weights to calculate our mean weight, we're going to first filter that out initially using our filter command, right? So filtering that something that isn't any, so we'll only be left with rows that don't have an any value for weight. And then grouping by sex and species ID again, and getting our meat weight. And we see that we get the exact same output, although now we don't have that missing NA for sex because we filtered and then calculate it. Yeah. And then if we want to do something similar to what we were doing with that head function, we can specify essentially how many rows it's going to give us. So we can specify print n equals 15. And you see the formats a little bit different when we did that head commands. It was a similar to how these were displayed. Very pretty, very nice to look at. But now we are looking at the base our way to look at it. It's just latex. Latex. Yeah. So we can also do a couple of other summaries. We aren't just limited to one at a time. So say we want to remove any weights group by sex and species and then do the mean weight as well as the minimum weight. So if we run this code, we see that it's very similar to before. So we have our grouping variables on the left here. And then we have our mean way from before as well as our minimum weight gets us a little bit better of a story from our data, which can be pretty helpful. And then if we wanted to arrange our rows based on some criteria, we can arrange by minimum weight. And note that by default, we will go in arranging or ascending order. But we can change that by nesting in this des argument, which will then just arrange it by descending. Okay. All right. So challenge seven three guys here. So using group by summarize, find the mean min and max hind foot length for each species and add the number of observations. So they we give you a little hint here to do our little help syntax we do question mark and and it gives us a little bit of information about the end function. It looks a little bit prettier when you're actually in our rather than these tutorials a little bit up in that help pane on the side. But definitely helpful, especially if you're working with functions that you aren't super familiar about. And you don't want to go dig it online. You just do a quick question mark that function. Most of the time will give you everything you need to know. Sometimes a little bit extra things also. I see you're all typing away so maybe a couple more seconds here. Okay, so if we want to get started, similar to what we were doing the last couple of slides so we want to put our data set name. So in this case we're working with species edited. And then we want to pipe. And then we want to do our grouping first because we can't summarize if we haven't grouped right sold a group by we were group by species ID and we want to do our summarize function. And in here we can say, okay, well, we want to find mean in Max for hind foot length. So you can do mean. I foot length just do shorthand. And to get the mean of that, you can do the mean function. Of the high foot. Like variable in our data set. And then because we're doing multiple summaries. This is where indenting can get a little bit of a more frustrating that's where we can also do minimum. High foot length is equal to again, so we have that mean function. And then we can do the same for the Max goodness. Okay, and then this last component. So we want to summarize the number of observations. So we'll get into a little bit of an easier way to go about this. We can just do them observations is equal to just get that end function. And it will essentially tell us how many observations are in each of those groups. So we run this surveys edited. Come on guys, you're supposed to be my eye in the sky. Surveys edited. I think it's impossible to code without a single error message. That's like when you know you really mastered it. But we see we've got our species grouper on the left. We've got the mean men and Max for hind foot length. And then we've got the number of observations in each of those groups. And you can change what the columns are. On the left hand side of our summarize function here. I'm really lazy and I'm just going to assume everyone knows what I'm talking about. So if you have any questions, please. How we feel about this challenge. It's pretty cool. This is a little glimpse of the stuff that you can do with deep layer gets. You can make it really complex. A lot of you are working on some really cool projects. So you'll definitely get the opportunity to test these out. Okay, so looking at this next challenge. I'm sorry. I did question. Why do you rename? Why do you put in the men underscore HFO? So similarly to when we were doing that mutate function, we want to take this thing that we're summarizing. And we want to let our know how we want to summarize that is we're essentially making new columns. Yeah, that's just the column name you're specifying you can call it Bob. It doesn't have to have me in your max and there's just more informative if you have what it summarized as part of their name. So now I have Jim. So is it for now? I think so again. Yeah, so I just try it out. Yes. It's at the function. Yeah. Yeah, so for any of you that didn't hear that. Essentially, do you have to rename on the left hand side of each argument here? Do you have to rename what you're summarizing as? No, it'll just show up as whatever your actual code looks like. Yeah, it makes it harder to work with and your variable names. If you were to want to work with that all later, you have to use back ticks around it and it's very complicated. So if we can avoid having parentheses in our variable names, that's better. Just make it as many dots as you can. So like one variable is like 10 dots and there's nine. Make it as confusing as you can for everybody would appreciate that. Now seven. So what was the heaviest animal measured in each year? Or what was the max weight in each year? And we're looking at the columns year and weight. So remember how do we choose what columns we want to look at? Select function. I'll give you guys a second to take a crack at this. I'll do some live coding because that's really what this is all about. Alright, so. As usual, you guys get the idea by now. So we're going to start with our data set name. And fight. Sometimes do exactly what you just did where you hit enter. I'm pretty sure it's because I'm not doing the space. I don't know what it is. Because that's when I was working with our data set and I was like, every time I enter to return it just did the same variable over and over again. I'm sure there's a way to turn off. I never have that problem when I'm working like script files. And like actually just coding outside of learn our tutorials. I don't know. It's just kind of an anomaly. Hopefully they fix it. But yeah. Okay, so we want to group by year, right? So we want to look at the summary, the max weight in each year. So we'll do our group by year variable. And then we'll do a summary. We can just do select. Here. This is the work and my scratch that we do have to do a summary. I was going to try to be sneaky and find a different method, but probably not for the best. So. You do. Max weight. It's equal to the max. Of. Very quick, so it doesn't pick up. And then we want to select. The columns here. And I'm just going to do max weight. That's really what we're interested in. We want to actually be able to look at what is the largest. We can do a range by. We can do. So we can see that. Oh, I didn't even find what the heaviest and was, but this was the heaviest weight. And we could do species ID. In here. We were also interested in what that. That must have been removed in the states, but yeah, so with our tools, we could kind of change how we're displaying this data. I'm doing a pretty bad job of showing that. Any questions about this? I'm sure the solution is a little bit cleaner. Alright, so kind of like I hinted at before, so we did that summary of number of observations is just that end function. So deep layer has this count function in here, which essentially will do that for you. So if we just do count sex, it groups by sex and tells us how many observations are in each group by default. And it is a much easier way to go about it instead of doing group by summarize. Especially if you're just specifically interested in how many observations and not necessarily a whole subject other. This was the method that we use before and we can see that we do get the same. Output is just one is just using this very nice count function. The other is grouping and then summarizing it. Alright, and then we also have this sort argument. So sort will just arrange it for us. So very similar to having group by summarize arrange descending. This just does it all just adding a single argument in our account and we can do multiple variables. So we're looking at sex and species here. As we can see before when we did two separate groups, same idea. So grouping by sex and each species within that sex and then getting the total number of observations in there. And if we want to arrange it. We can essentially. This is pretty much what our sort function was doing right so we're arranging the species ID. Based on the number of observations that are in red. So, within each species ID, we have the most than the second most and then is. So, it's arranging within each species ID rather than just arranging the entire. Which can be nice, especially when you're trying to keep your data a little organized, you know, you want to have the species together. But you want to see the general trend within each species. Any questions so far. Okay, so challenge eight, how many animals were caught in each plot surveyed. So give you guys a second to get started on this. Oh, here, I'll give you a hint to get started. So we're going to take our data and we're going to group by lot ID. I'm already forgetting the stuff I just taught you guys. We could do group, we could do summarized or we can just do count. But beautiful. Super fancy. Sort of. And we can see plot 12 as the most number of observations in the most number of animals caught. So I'm going to hand it off to Greta here for relational data. So oftentimes what we have is we'll have information collected on observations. So we've got observation level information and we have information at a site level. And then we might have information at a species level. And we might have them in three different data frames or data sets. And we could, you know, copy and paste and excel and bring it all together. Or we can let our do it so that we don't make any mistakes. So oftentimes that I'm working with people and they have multiple sources of information. I say please. I'll take care of it for you and merge it all together. So that you don't have to manually try to bring in all this information together. So we have deliberately kind of masked some of the information. We just have a plot ID and we have a species ID. We don't really know what they are, but we're going to get in import two other data sets that have that information in it. And we're going to join it all together. So the plots data is in a separate file. And we bring this in. We'll see that the plot IDs specify if it's a spectab exclosure, if it's a control plot, if it's a long term crat exclosure, and so on. There's two different kinds of controls. The researchers know the difference between a control two and a control four. Then so there's only two columns in this data file. Then we also have a species data file that has takes that species to letter ID code and gives us the genus and the species as well as the taxa. So now we can actually see which ones are birds and which ones are rodents. And we can get into the genus and species if we need that. How does it know those rows or columns, variables are linked? It doesn't yet. Oh, we're getting there. Okay. We know that they're linked, but we need to tell our how they're linked. So in each of these files, we're going to have a key and then extra information. So in the plots data table, so plot of land, right? Not like a graphical plot. Which one would be the key or which one would link to our original data? So you've got a choice of two. It's what we had before. So the plot ID is what we're going to use for linking. And we have this fun little quiz here. So I've got you the first one. What would the key be for the species data table? Which one of those have we seen before? We've seen species ID. So we're going to link based on species ID. And the surveys data table, what would the key be there? It's actually a record ID because each observations identifier is its record ID. So another way of thinking about a key is the identifier for the information. What makes it unique? So a primary key is within a table that uniquely identifies an observation in its own table. And a foreign key, this is that SQL stuff that's coming out, identifies an observation in a different table. And we want to make a relationship between these two tables. They don't have to be generally there are many to one. So we'll have a smaller table that's got information that's shared between multiple observations. And so we'll have one observation in a table that links to another table. If we have, so plot ID is goes to multiple observations, that's many to one. It could be one to one or one to many. We're doing some work on voter roles. So we're going to pull some people after the election this year. We're not pulling people before the election. We have to link voter file information to voter history. That's one to one, right? So one voter has one history record. But oftentimes we'll do many to one. So there are many observations from each plot, many observations from each species. And in the surveys data set, we have two foreign keys, plot ID and species ID that links to two different tables. So how do we join all this together? We're going to do what's called a mutating join. And it's similar to mutate, but it's actually we're going to give it a name of a type of a join. So we have to talk about different ways that we can join data together. So an inner join is where we have, we only want the intersection between all of the data frames. So it has to observation has to occur, or a key has to occur in both data sets for us to keep it. So in this particular case, only records one and two existed in both data sets. And so we're only keeping records one and two, three and four were not in shared, so they're not kept. It also keeps all columns. A left join is the most common one that we use keeps all the records from the one on the left. And pulls in all the information for matching keys from the data set on the right. So one, two and three were in the left side, left data frame. And so we're only keeping one and two from the right. Right join is the opposite. Full join keeps all records from both data sets and fills in missings where it wasn't observed in the other data set. So let's actually join this together. So we're going to do this in two steps because we have two data frames order doesn't matter here. And it's because we have plot ID and species ID name the same in both data frames. Then we don't have to we can just say by equals and the keys since it's named the same. So we're going to combine that. Let's do this step by step. So we can take surveys edited and left join in the plots. And so now we have plot type based on that plot ID. Then we can also add in or do all at once add in the species ID. And that adds in the genus the species in the taxa columns. If it joins correctly, we won't will only see one column for plot ID and one column for species ID. If it has a problem, it'll add a dot X and a dot Y and keep two copies of those columns or the keys around. If they're not named the same. This is actually pretty tricky and I don't know why they specified it this way so the function or the parameters by equals. And then we need to combine the two keys together. A is the key from the left data set and we need to put that in quotes. Then we need to say equals and then B is the key in the right data set and also put that in quotes. Then if we do that, it'll all match together so you don't have to make sure that they're named exactly the same. So most of the time I will specify that or I'll be using this function. If you're confident that the key is the same in both, you actually don't need that. See what happens if we take that out. Same results because we knew that we reuse the key ID and we didn't have to specify it. But again, you have to really know that that's what's happening before you actually trust it. And were you doing this in a lamp? Yeah. All right, I didn't have to talk much today. So, similarly to how we talked about joining data, now we're going to be taking data and essentially changing its major shape. So when working with data, there's really two main forms. So we have wide data where each row represents its own unique observation. And then we have long data where this is how I think about it. Each column represents its own variable. Is that accurate? Yeah. So two different ways. And we're going to talk about transitioning between them. So going from wide to long or long to wide. So this is a helpful graphic. So this is long data. It should say long up there. And this is wide data. So we can see in our wide form, each ID is its own row. And here, each variable is its own column. And we see that we have multiple rows corresponding to the same individual essentially. So if we want to pivot into a wider format, there's a couple arguments. So there's this names from argument. And that's the column where you're getting the names for what will be multiple different columns. So in this case, we're taking the names from genus and the values from mean weight. And this is what our transformed data will look like. So we have our plot IDs. You can see we have three rows for each individual. And then we have, it's a separate column for each genus. And the values corresponding to that combination of individual and genus as the corresponding values. And it's important to note that these graphics were created using the full data set, not our subset of data set. So your numbers aren't going to match if you tried to do this. Yeah. All right. So if we take a look here, so we're just taking our combined data, removing missing weights and calculating the mean weight for each plot genus combination. And that's our surveys underscore GW. So let's see our data set here. So this is pretty much what we're working with. So using pivot wider, we can see that here we have multiple plots for each genus weight combination. So if we were to take this in the wide, wide format, we can take the names from this genus column and the values from this mean weight column and turn it into a wider format. So here we can see that for each plot ID, we have a single row and for each genus, it's its own separate column where mean weights are kind of joined at that combination. All right. So right into a spicy challenge. So pivot the combined data from to the wide format with year as columns, plot ID as rows and the number of genera per plot as the values. So you will need to summarize for reshaping. So we're essentially going to create this value or this variable number of genera. So I'm going to give you guys a second to get a bit of a head start, but this is a pretty lengthy. So I'm not going to leave too much time again. I'm feeling generous. So here's a hint that hint. So we're looking for this new variable. So the number of generic per plot. So we can start by grouping by plot ID and then summarizing the number of generic. So the way we do that. So it's color variable. And then we can use this end distinct, which essentially counts how many distinct observations there are. We're going to pivot this so we can do this all in one step. Similarly to how we were just kind of laying stuff on before. So we can start our pivot underscore wider. In this, we're going to have a couple of arguments, right? So if we think back to our diagram before it. So we're taking names from a column and values from another column. So here we specify that we want the names for year as our columns. So we're thinking about what our names from argument is going to be. So if we want each column to be a year, we're going to take our column names from the year variable values. Are going to be the number of generic, right? So that's what we're just working with. And because we summarized it, it's essentially already in our data. So we can run this to make sure object generic not found column year does not exist. You need a group by both plot ID and year. True. Thanks for the catch. Appreciate it. Yeah, so we can see now that each row is its own plot ID, which is what we wanted. And each column is the year. And for each plot ID and year, we see the number of generic that were in that. And I was like to skip the naming step to make sure that my code's doing what I wanted to before I then store it as something. But now we can store this as all we want to be useful. Right. So let's do surveys. Wait, we will never get confused. And we could just throw our glimpse command in here. And we see that instead of having a whole bunch of information as our variables, we just have years and then a whole bunch of information within each year. And I bet you can see why this is called wide format. So really great for visualizing data, not necessarily great for calculating values. So we'll get into that a little bit now that we're looking at some longer data. So longer data, we talked about each column represents its own variable. So if we think about going from Y to long, so instead of having names from and values from, we have names to and values to. So in this case, we're going from Y to long, we're taking the names from columns and sending it to genus and the values from these columns to me way. And the syntax is a little bit different. So we have to specify the columns that we want to use. That's kind of our first argument. So if we look at our example here. So for the columns, well, we're looking for essentially everything that's not plot ID. So shortcut, we can just do minus plot ID to specify everything but that. And then we want to take the names of those columns and put them into genus and the values that were underneath those columns and put them into me way. You can see now we have a lot ID and we have multiple IDs per essentially combination. We have repeating IDs. And then we have genus column, which has all the genera and the mean weights. So here we can essentially do the same thing. But instead of saying this is kind of different syntax than saying not this column. So we're taking our these sets of columns, right? So this is our vector notation, something colon, another thing. It's essentially the range between those vector notation. So if we think about our genus or our columns. So we have essentially a whole bunch of columns. Each column's name is the genus. So with this writing, we're essentially selecting these columns, taking the names, sending it to the genus column that we're creating and the values to the column of mean weight. So now we're back in that long form. All right, any questions so far? Let us jump into another challenge. So, all right, cool. I hope that means I'm doing a good job and that everyone's not completely confused. Okay, so take surveys wide genera and use pivot longer to pivot into the long format. It was before that each row is a unique plot ID by your combination. So just to help by selecting our columns, if we have the names in front of us, we can kind of see what we're working with. We'll give you guys a head start here. Not going to give you the answer. I think about where it started. So we want to work with our surveys wide genera data frame data set. And then we're going to do our pivot longer function. We could say columns. So looking at our names. So if we're looking for all of our year columns so we can do 1996 to 2002. And because these have to be both. Okay. Or another way to do this is to columns minus plot ID, right to specify that we just want everything but plot ID. And then we want to send the names to. Well, where do we want the names of these years to go? We're going to have to make a new column, right? So we want to have a new column called year where all these values are all the names of each. Wide column goes to. Then the values that were in those columns. What information was contained under each of those columns. So if we think back to what we were working with. And hinted at by the data set name here. So we have each of these values represents the number of genera in that year. So we can send those values to column. Generic. If we look at our data set. So for each plot ID in here, we see the number of generic. And the reason that this would be more beneficial than like a wide data set. Is if I wanted to calculate the. Just the average number of genera. By year. It's now a little bit easier group. I guess that specific example would be easy, but if I wanted to. I can't think of an easy, but you get the general picture. Okay, so any questions about that challenge before we move on? Why is it the back text is a variable names. If a variable name starts with the number or has spaces in it, or it likes it in as back text instead of it. So, so, but the code not work. You put it in both and said back. I try to tell you. Trying to save times. More time. Generally, so I usually use. Back texts, because if you were to do. Surveys, why generate a dollar sign. You were to do surveys. They're a dollar sign. Yeah, and then if you were to select 1996. Select it. Not. No. Dollar sign. Select it from the list. It's a dollar sign. Pick one of the number. Yeah. What. Oh, I thought it changes that. So it's just. In normal R when you do that, it puts back takes around it. And so usually like. When with dollar signs and then the variable name likes back takes rather than. In this particular tidy versus it says it's the same. Thank you. All right, so looking at. Next challenge here. So taking our combined data. So we've got 2 measuring columns, hind foot length and weight. And it would be nice if we could have a single column. That's just measurement. And then another column that has the values corresponding to each of those. Combined dataset just so we can get an idea of what that looks like. So our goal is to essentially combined. Combined weight and foot length into a single variable. Essentially having 2 times as many roses we do know. So give you guys a sec to get a head start. But the rate that I coded would probably take 3 times as long as. Okay, so we're going to do our pivot to pivot. Longer. And then the columns that we're interested in because they're right next to each other. We can. Do. Hind foot length. And then we want the names of those to go to this new column measurement. And then the value is from those. You go to the same value. We think about what this is going to look like. So looking at our new variable. So for each. Instead of having 2 columns, we now have a single column. And we have a value corresponding to each of those. So now that we know this is doing what we wanted to go back and we can rename this bind. Longer. Any questions about that? Alright, so looking at this last one here now. So taking with this new combined longer dataset, calculate the average of each measurement in each year for each different plot type. So if you think about what we're going to be starting with, so we're going to take our combined longer data. And then we're going to be grouping by a couple of things here, right? So we're looking for the average of each measurement in each year for each different plot type. So we're going to be grouping by 3 things. So. Measurement. Year. Plot type. And then for each of these groups, we're going to be summarizing. Yes. So summarize the mean measurement. So if we remember. The values corresponding to that measurement. We're called value before. So you just do mean. Values. And we're going to pivot these summaries into a data set with a column for a hind foot length. So. We take a set a second here. Remember if it was capital being. Because we just do those. Values. Good. Okay, so if we look at what we got going on right now. So we have this situation. Right. And our goal is to have a data set that. Has a column for hind foot length and weight. So essentially undoing what we did before. So we're going to be pivoting to a wider format. So we could do our names from. Our measurement. Variable. That's where we have. I foot length and weight. So essentially taking those two names. Putting it in this new variable or essentially separating that into their own columns. And then values. From. We want to do average value, which is the summary that we just. We can see now we have a similar summary, but instead of having measurement and. Average weight for average value. We now have year on type. I foot length and weight. And the values under each of those columns correspond to. The average values. For. Their respective. So I think I foot length was millimetre and this was frames. Any questions about that? To make this a little bit easier to see what's going on. Do you want to go through the additional practice? Oh, right. All right. So exporting data is our next topic. So. We talked a little bit about in 1 of our workshops about how we can take data that we've. Essentially add and rewrite it as its own thing. So we're going to go a little bit more into the specifics about that. So. First thing that we want to talk about is if you don't have like a separate data sub folder. Where your script file or whatever you're working is. Running this code essentially creates 1. So if you mentioned in like a path like data. And then forward slash it'll open up the file that it just. The folder that is created. So say we're working with some data and we've. Need this survey is completely like really nice. No missing values. Great. So do a little bit more wrangling. So we count the number. The total number of observations for each species ID filter out anything less than 50. And then. So here we're filtering species ID within this species counts. Data set within the species ID. So essentially taking the species IDs that had numbers greater than 50. And filtering those out so that we're just left with those that were great. It's kind of around about this. But then our new surveys complete subset. Really nice data frame that we just made and we can run this right underscore CSV command. So it'll take whatever the data set we mentioned here. And write it into this path. So we have that data folder that was made from that above argument or that above function. And story it as surveys complete subset dot CSV. And this is where you would change the name of it if you wanted to change it from what it was in your actual our environment. So. Yeah, pretty nice, especially if you want to just have one script file that takes the whole reading in the old data, all of the wrangling and then writing it out. Rather than having to go through Excel and I don't know, I don't really like Excel because it's a little bit of a pain to work with. This kind of makes it so you can do everything all of this. Now do we want to go through that additional practice. Kind of short on time. They can do the additional practice. Yeah, so we have one more slide here. It's a challenge. If you guys want to take a look at it. Not no worries, but we do have a solution there if you want to take. Well,