 If you're like me, when you first learned the group by function from dplyr, you learned it with the summarize function. It almost seemed like these two functions could not be separated from each other. It was kind of on a whim one day when I said, huh, I wonder if I could group my data by some variable and then treat all those data together. So say like I wanted to calculate a relative abundance. So I had a whole bunch of data from the same group. I wanted to calculate the total and then I wanted to divide their counts by that total. Would it work? Of course it did. And that's one of the powerful ways that I've used group by in my day to day work. As we are marching through these videos in Code Club, I've been trying to link COVID vaccination rates from 15 countries back to data that was published by Ipsos for those same 15 countries telling the world whether or not those countries would receive the COVID vaccine from back in October of 2020. So we're trying to link data that's a year apart from two different sources. And, you know, it's not easy. But the group by function is one of the tools that we are going to use to help us to get the most recent data for each of the countries for eight of our 15 countries. We have data from the same date, say October 28 of 2021. But those other seven countries maybe have data from yesterday or the day before a couple of days earlier than that. And I don't really care about a couple of days difference. But I want those data to be treated unparalleled with the data from October 28, right? So the United States says October 28, but Brazil has data from a day or two ago, right? So how do I bring those all together so that I get the vaccination rate for the last day that we have it? Okay, so if this isn't making sense, let's head over to our studio and hopefully I can clarify things for you. Here we are with our code that we've been working through in the last several episodes. I'll go ahead and run this and this loads the tidy verse. It gets the CSV data from our world in data. I'll call this OID. And it's a big file, but you know, it works great with read CSV to read directly from the URL. It selects the four columns that we want the location, the date, the percent of people that have received at least one dose of the vaccine and the percent of people from those countries that have been fully vaccinated. And we have the data for every day going back to February of 2020 when this whole debacle started. If we look at OID, we see that we have Brazil from October 27th of 2021. So yesterday or two days ago as I'm recording this. In contrast though, if I were to say, remove these two lines and uncomment these lines, right? So I'm looking at the 15 different countries in my study for October 28. And let's go ahead and remove the drop and a what I see is that of my 15 countries, seven of them don't have data for October 28. Like we saw for Brazil, it has data from October 27. So how do I get the fully Vax and all Vax data from Brazil for October 27? We're going to use the group by function. Okay. So in the last episode again, I showed you how we could do this using slice max. What I want to do is I want to group my data by the country. And then I'm going to use slice max to get the row with the largest date, right? So we're going to get the maximum date for each country. So as we are working through this, I'm going to go ahead and read in the data frame as OID. So I'm not constantly downloading it. I'm sure their database doesn't appreciate that. And it just takes time. Right. So I'll go ahead and read in OID as this data frame, selecting those four columns, right? And so we have, you know, 127,000 rows and four columns starting with Afghanistan. I'm going to come down to the filter statement I had earlier, and I'll take OID and pipe it into that filter. I'm also going to leave out this date. So now what we get out are 9,610 rows for each of our countries. I could pipe this to count on location to see how many rows we have for each of the 15 different countries. So, you know, sometimes if you don't get data in the output, it could be because when we did our filter, we spelled it wrong, right? That's not the problem. We have all 15 countries here. They're all spelled correctly. They have rows. We need to get the final time point where we have vaccine data. To do that, as I mentioned, we can take group by and we can group our data by location. And so if we look at this now, the output looks the same, except at the very top commented out says groups colon location 15. And so that tells us that kind of under the hood, R has taken our pipe or OID data, this whole pipeline, and it's segmented the data by location. So you can think of it as like we have one big data frame, but ours kind of winking and saying, well, you've really got 15 different sub tables here, right? And so what we're going to do then is on each of those sub tables run a different operation. So we can pipe that grouped data into slice max. So I'm going to tell slice max that I want the three largest values of date. So I'll say date, and then n equals three. So we get back them are three rows from each of our countries for each of the 15 countries, right? And so again, that's the beauty of group by without summarize. When you summarize, you take all your data and you condense it down to a new row, the new row of a new data frame. Whereas if you don't use summarize, then our data grouped by whatever the grouping variable was, and we do an operation on that group data. So we basically ran slice max on each value of location. So each of those 15 things, you can think of it as running slice max 15 different times. All right, that's that's pretty cool, right? Now, though, I'm not interested in the maximum date where I have an NA value. So I want to drop NA, but I want to be careful about where I drop that NA. So I'm going to drop that NA before the group. So I'll do drop NA. And I can say fully Vax and pipe that in to my group by and slice max. And so now I see that I have Brazil, 27, Canada, 28, China, 23, right? Now, I don't need the three most recent dates with data. I just need the most recent. So I can go ahead and change that three to a one. And now I see my 15 different countries and the date where I have the most recent fully Vax data, right? So China goes back to 1023. I think that's the oldest. That's five days previous. That's not that old, right? Although the all Vax is NA. And so that's a little bit disturbing. I'm not so the all Vax again is like people that have received any vaccine, you know, they've been jabbed at least once. I'm not totally sure that I want that all Vax data. But let's see how we could do it if we wanted the all Vax as well. Well, we could go into drop NA. And we again could do all Vax comma fully Vax. And that way, then we'd remove any row where all Vax or fully Vax were NA values. Now that we've added all Vax and fully Vax to the drop NA, we now see that China goes back to September 18. So more than a month ago, to get a value for all Vax and fully Vax, I suppose if we were trying to be a little bit more sophisticated, we could use the more recent version of fully Vax and that older version of all Vax. I'm not totally interested in getting that complicated in this type of analysis. In the end, I suspect my analysis will probably end up using fully Vax. So for the time being, I'm going to go back and only use drop NA on the fully Vax. So I'll go ahead and remove that all Vax. And again, this is what we get. I now need to do two things to clean this up. First, I need to get rid of that grouping. In my experience, the grouping needs to go away unless you know what you're doing with the grouping that that grouping can cause problems down the road when functions start working inside of a group versus across all the groups, right? So if I wanted to calculate an average across all of these, it wouldn't because it's grouped by location rather than no grouping or grouping across all locations. So I need to remove that grouping. I can do that by coming in after slice max and then do ungroup. And so ungroup then removes that grouping variable. And so now we see in our table that we no longer have that grouping. The other thing that I want to do is go ahead and get rid of that date column to do that. We've talked about this in the select episode, we can do select and then minus date. And so again, that's the negative configuration or the negative approach to using select, where we can run that. And now we have a simple data frame where we have the location, the all Vax and fully Vax columns. And we don't need that date because we're confident that these are the most recent observations or most recent reportings of the percent of people that have received at least one jab or are fully vaccinated. Let's tie this together with the rest of our pipeline now. And I will go ahead and delete this. And yeah, we'll put a pipe here. And so we'll see that our OID data frame is now what we expect. In this episode, we saw the beauty of being able to use group by without the summarize function. I think it's really cool that again, we can group our data by some variable, do an operation within that set of data that may or may not reduce the number of rows. What we did today did reduce the number of rows, but it didn't have to. But that we could then ungroup it and then proceed with the rest of our analysis. And so what we did here is we group the data by country, we then used slice max to get the row that had the largest, the maximum date for each of those 15 groupings. And so by using that drop NA function strategically, we could remove those rows to make sure that we're getting the most recent data. And as we saw with China, we have to go back a little bit more than a month to get both partially vaccinated and fully vaccinated data for China. So again, I'm not totally sure that I want that partially vaccinated data, but we'll come back to that in the future if we need to. All right, so hopefully you can see where we're going here, that we now have a cleaned up data frame, something that we probably could have done by hand by looking at the website or googling around. But ultimately, this is far more robust and reproducible and automated, because I can come back in two months, rerun the analysis and get updated values for these 15 different countries, just by running the script, it just takes a few seconds versus me screwing things up, most likely by going back and forth between the our world and data website. So in the next episode, we will talk about how we can join data frames together so we can combine what people said they were do with what they actually did. And what I want you to do, besides getting vaccinated, so get vaccinated if you haven't, is I want you to subscribe to this channel and click the bell icon so you know when that next episode is released. And we'll see you next time for another episode of Cout Club.