 Hey folks, I'm Pat Schloss and this is another episode of Code Club. In this series of videos, what I want to do is compare current vaccination levels against the COVID-19 virus SARS-CoV-2 to what people said a year ago was their willingness to receive the vaccine. Of course, a year ago we didn't have a vaccine and so it's kind of idle speculation. So here we are a year later. Did people get vaccinated if they said they're going to get vaccinated? Well, in the last episode, we talked about how we could go ahead and go out to find data online that we trust and that lends itself to automation and reproducibility. We found the Our World in Data website, which generously enough had a nice shiny blue button that allows us to download the data that they use on their website for a whole bunch of different variables. We're interested in those vaccination rates for each of the countries. This data file also has a bunch of different countries and daily or weekly data, I forget which, but that they didn't have like 65 different variables that they're measuring on all of these countries and all of these time points, but I don't want all 65 columns. I only want the columns related to the vaccination rates for each of the countries and for the most recent data. Well, in today's episode, we're going to see how we can use the select function to select the columns that we want out of that data frame to go ahead and make it a little bit simpler. And while we're at it, to give those column names names that are perhaps a little bit more meaningful to us. I'm over here in our studio, where I've got my very simple comparison figure dot r script. Again, very simple. We load the tidy verse so we can get all the great tools from read R from d plier from gg plot two. But today we're going to talk about the select function that comes to us again from d plier. Go ahead and run these. We now have that CSV loaded as the OID data frame. And I can do view on OID view opens up a read only spreadsheet view of the data frame. Again, this is a gigantic file. It was a 33.3 megabyte data frame, if I recall correctly, with about 127,000 rows and 65 total columns. Again, I'm not interested in all these columns, although there are some really, really good columns in here that I think would be really valuable for a lot of things. We see that for some of these early time points, there's a lot of NA values, we'll talk about that later. But we can see in here that we've got various columns with vaccinated vaccinations, things like that. So what I want to do is I want to end today's episode with a column for the location, which is the country, the date that the data was recorded. And then I also want to be able to get the number of people that were partially vaccinated, and the people that have been fully vaccinated. So basically a data frame with four different columns. How do we do that? Well, we're going to use the select function to do that. So we can take OID. And we can then pipe that into the select function. And so I might say select date. And so then this gives me a new data frame, a new table that has 127,000 rows and one column. So I can give select the name of the column that I want to get back. I could also do select minus date. Select minus date means give me back the data frame that doesn't have the date column. Looking at this, we now see that we no longer have a date column in our output. I call this negative mode, where you're telling select what you don't want versus this is the positive mode where you're telling select what you do want. So how do you get back multiple columns with the select function, right? Well, what you can do is you can combine the different column names that you want to get back with your select function. So I can do date comma location. And then I get the date and the location, right? So I've got these dates from February of 2020 from Afghanistan, right? And so again, if there's different columns that I want to get out, I can combine those with a comma. So we also want to get that vaccination data. Unfortunately, if I type O it down here, I have so many columns against their 65 columns that I get the dreaded dot dot dot at the end. So I don't know, you know, with the names of those other columns are again, if I come back up to the spreadsheet view, kind of have to scroll across here. And I think that there's so many wonky things going on. Yeah, it's the column view here gets really messed up too, right? Like I could perhaps go through here. But there's so many things going on that it's it's kind of easy to lose track of things, right? So there's people fully vaccinated, people vaccinated, total vaccinations. And I have to keep scrolling across here to get all of these. And it's not super, I don't know, I don't trust myself to write these down faithfully without introducing typos, you if you've watched these videos, you know that I'm really good at introducing typos. So another approach would be to do call names on OID. And this then outputs the 65 column names of our data frame, right? So this is really nice. I find that call names is a really useful function when I've got a whole bunch of column names like this. Again, one of the challenges with so many columns is that it might be easy to overlook the column that I'm actually interested in, right? So what do we do instead? Well, we can use the select function here, even if we don't know the actual name of the column we want. How do we do this? It's magic. No, it's not. We're going to take OID and we'll pipe it into select. And we're going to use a series of what are called helper functions. And so one helper function that we might use could be starts with right. And so just kind of looking at what we have here, I could say starts with access. And so this will give me any column that starts with the word access. And so here now I see that I've got these four columns that start with access, right? I could also then pipe this out to view so that I could get the full names of those four different columns to see what's going on there. Okay. So that starts with, and again, we put in whatever we want the column to start with, and then we'll get that back, right? So we might then say, well, what about Vax, VACC? And there's nothing there. None of the none of the columns start with Vax. We could use another helper function with select, which would be ends with. And so here we might say 100. Because I remember there were some that were per 100, right? And let's go ahead and pipe that out to view. And here we see four columns that end with per 100. So I think we're gaining on it. Let's learn one more helper function that goes with select, which is contains. And here we could put in VACC. And let's go ahead and pipe that out to view as well. And so what this is going to do is this is going to match any column that contains VACC anywhere in the name of the column. So now we get a data frame that has nine total columns, where each column name has VACC somewhere in the name, right? So we had those per 100 that we just saw earlier with ends with. But we also have all these other ones as well. So I think I'm interested in these Vax per 100. But I noticed there's vaccinations per 100, vaccinated per 100. So we could do OID and pipe that again to select. And now we can use a special helper function called matches. And matches, you can give it what's called a regular expression or a special pattern that will match. And it's a pattern, as you'll see, that doesn't look like normal English, but I'll walk you through a kind of a basic tutorial here. So I'll say I want it to match Vax and then any character. And then I want to match any character zero or more times. And then I want to end per 100. And so this is going to match this pattern, right? So it's a pattern that starts with the VACC and ends with per 100. And it might have other text in between. And we can then pipe this out to view. So now we have total vaccinations per 100, people vaccinated per 100, people fully vaccinated per 100. I'm not quite sure which ones I want though. So let's go ahead and see if we can't simplify it a little bit further. I'm going to use a filter to look at the United States data. So I'll do filter location equals equals United States. And we'll pipe that into our select. So that filter function I just showed you, I'm going to talk a lot more about it in the next episode. So please, please, please make sure that you're subscribed to the channel and you click that bell notification icon. So you know when that episode comes out. Okay, so we have a whole bunch of NAs in here. I'm going to go ahead and come to the end. And we then see total vaccinations per 100 values over 100. And so I think this means number of kind of pokes right with a needle 65.8. I know that's about 66, right, which is the percent of people in the US that are at least partially vaccinated. And then people fully vaccinated per 100 is that 56.86. And that is the fully vaccinated, obviously, as it says. So I'm going to go ahead and pipe this into call names instead of view. So I can get those different columns that I want. So I'm going to go ahead and grab those three lines. And I'll remove that pipe. And I'll go ahead and get rid of this matches function call. And I'm going to put in the actual columns I want, because the data might change over time. They might add other columns that have VACC and per 100. So I really want to be specific now about the actual columns that I want. So I'll go ahead and put in location date. And then I'll do people vaccinated per 100, as well as people fully vaccinated per 100. And I'm going to go ahead and put these on separate lines so they don't scroll off the right side. And so now what we see is that we have location dates, people vaccinated per 100, people fully vaccinated dot dot dot, but that's per 100 as well. We could also then pipe this into the tail function to see the last six rows of the data frame. And those numbers look pretty good. So great, we have taken 65 columns, and we've winnowed them down to four columns that we're interested in using the select function. And we got there using the helper functions, where we talked about starts with ends with contains and matches. And the difference between contains and matches is that matches allows you to use that regular expression that we saw up here with the VACC periods to our per 100 to match that. And while those are helpful for kind of data exploration, if I'm working with a data set like this that might be dynamic and changing over time as more data becomes available, I really do want to kind of pin it down to exactly what I want. Again, this is part of that idea of positive mode of using the select function. So one last thing that I want to share with you is actually something that I learned from somebody that reads the newsletter that I signed out every week. Are you getting my newsletters? Why don't you subscribe? I'll put a link down below in the description that you can go to. If you go to the riffmonos.org website, there's a little sign up box there and you'll get a weekly newsletter with all sorts of great tips as well as practice problems to help you to improve your art data science skills. So one of the things that he showed me is that I can actually rename the column in the select function. And so I can do that by saying the name of the new column equals the name of the old column. So it will select it and it will rename it. So I'm going to go ahead and change this people vaccinated per 100 to be all vaccinated. And then this people fully vaccinated. I'm going to call this fully vaccinated. Now I've got those simplified column names all vaccinated and fully vaccinated. I maybe even could make it a little bit shorter. I could say all Vax fully Vax. That way the column names aren't so long, whatever. You do you. And ultimately what matters is that this column name is something that's meaningful to you. Again, this people vaccinated per 100. Wasn't as clear to me as this people fully vaccinated. So people are fully vaccinated. We know what that means. These are clearly the people that the OID website is defining as people that have received at least one dose of the vaccine. And so now we've got our location, our date, our all Vax and our fully Vax. We are filtering this to the United States. So I want to remove this for now because we're going to want to look at those other countries. And if I go ahead and make my script page here a little bit bigger, there's other stuff in here that I've been practicing with. I'm going to go ahead and delete this to go ahead and combine my data frames. One of the things I talked about in the last episode is that downloading this CSV file using read CSV and even just reading it in can be kind of slow. And so one of the things I like to do is to create the data frame OID. And then as I'm going through and building out the pipeline, I might use OID as the input to that pipeline without saving it back to OID. Hopefully that makes sense. But you can see what I did here, right? We tried a bunch of different things out piping OID into those things. I settled upon what I really like. And now I am going to pipe that pipe the read CSV into the select, I'm going to remove this tail function call. And I'll save that output back to OID. So it doesn't seem like we got very far here in today's episode, we added one function to our pipeline the select function. But along the way we learned how we can pick different columns out of our data frame using a positive or negative approach, how we can get multiple columns out, how we can use those helper functions of starts with ends with contains and matches to help us to find those columns that we really want. And then of course how we can use that select function to take column names that are perhaps really long and perhaps not super meaningful to us and convert them into new names that have more meaning for us. All right, so select is often thought of as a way to filter column names or to filter data frame on columns. In the next episode we'll actually learn the filter function, which again gets confusing. Select operates on columns, filter operates on rows. We'll learn how we can use that filter function like we did here with the location equals United States to get the specific rows we want out of our data frame. Again, be sure that you've subscribed to the channel so that you know when that next episode is released. And we'll see you next time for another episode of Code Club.