 Hey folks, I'm Pat Schloss and this is Code Club. If you've been following along, you know, I've spent way too much time talking about data from a study conducted by ipsos looking at people's willingness to receive the COVID-19 vaccine. Well, here we are a year later and people are actually now receiving the COVID-19 vaccine. Have you gotten yours? I have. I really hope you have as well. Anyway, we can now say, did those people actually receive the vaccine? Like they said they were. And what I want to do is we're building towards a figure where on the x-axis I can put people's intention to receive the COVID vaccine back from October of 2020. And on the y-axis, we can put the percentage of people that a year later or so have actually received the vaccine for these 15 different countries. We have found a data set from our world and data that is just so rich and has a lot of different COVID information for each country as well as at different time points. And it's more than just our 15 countries, of course. Instead of building the final figure that I want to build, I'm trying to take it slow and show you how we can use a variety of other functions from the tidyverse, specifically from dplyr, to see how we can manipulate a data frame and get it ready to join with the data we have from ipsos. In the last episode, we took about 65 columns and we winnowed it down to four. Right. And to do that, we use the select function. Now we want to kind of go orthogonal to that, right? And so instead of removing columns, we now want to remove rows. And to remove rows from a data frame in R, what we're going to use is a function called filter. And so the filter function, you've seen me use here and there. And I haven't shown you how we can use the filter function to customize a search, so to speak, to return the rows that we want and how that feeds in to the rest of our pipeline. So I've gone ahead and loaded the data from our world and data and run the select function that we did in the last episode. The next thing we want to do is to work with the filter function. So again, if we take OID, it's this big monsters data set. It has 127,000 rows, but thanks to select, now it only has four columns. So I only want those 15 countries that were in the ipsos study, as well as the most recent data. So to figure out the most recent date, I can pipe this into tail and see that there's data, at least for Zimbabwe, up to October 28 of 2021. Again, this data set is going to get refreshed. It looks like every day. And so tomorrow there'll be a new version and so forth. So let's work with the 1028 data. And as we go forward, we might see that we need to improve that. All right, so I want to get the data from every country on October 28. To do that, obviously we're going to use the filter function, right? So we'll do OID and I will pipe that to filter. And the argument to filter is an expression that evaluates to true or false. It's a logical value, right? And so what I can do is I could say date equals equals 2021 hyphen 10 hyphen 28. So I want the rows where the date column, this column here has values that equal October 28, 21. This is using a special equals sign. So this is a logical equals that basically says these two values are the same question mark, true or false, right? And so if it's true, if the argument to filter evaluates to true, then that row will be returned to the new data frame. If it's false, it will be rejected. So if we run this, we now see that at least the first 10 rows obviously are all from 1028, which is what we want. Alternatively, we might do something like exclamation point equals that. And that exclamation point means give me everything but data from 1028. So again, if I pipe this to tail, I now see that I no longer have the 1028 data from Zimbabwe because we've said give us everything that doesn't equal 1028. But obviously, I do want the data from 1028. Again, this filter function returns the data from 1028, 2021. I could pipe this into another filter where I could do something like say filter and I could say fully vax greater than 50. And so then what this should give me is of the of the rows where the date was 1028. I also want the rows where the value is greater than 50 for full vax or fully vax not full vax. So fully vax. And so now we have 40 countries where they are more than 50% in the fully vax. If I wanted less than 50%, I could use the less than sign, right? And then see here that there are 37 countries that have data for 1028, and that have a fully vax value less than 50%. So we can make comparisons using characters using things like the equals equals, as well as numeric using things like greater than less than greater than equal to less than equal to, as well as equal to right. So there's a lot of different operators that we can use that will return a true or false value. So for the time being, I'm going to go ahead and remove this second filter. And I also will do another filter, right? So we could also do filter location equals equals United States. And I think we did this in the last episode, right? Where we were kind of just trying to simplify this massive data frame down to the data for the United States. And so we now have two searches, one looking at the date for 20, 21, 1028 and one with the United States. And so one question we might ask would be, well, how do I get the US data on 1028? And I had just shown you a way to do this with the fully vax column. So what we could do, again, is very similar to that, right? We could take this one filter, and we could then pipe that into the second filter, right? And so basically what we're doing is we're taking the OID data frame those 127,000 rows, we're then filtering to get only those rows where we have the data from October 28. And then with that data frame, we can then say give us the rows where the location was the United States. And what we get back is that one row for the United States, right? So this is very convenient. But it's a bit verbose, right? There's a little bit too much going on here. And we could simplify this by having it be a single function. What we could do, we could do O it again. And I will go ahead and grab this filter statement. And we can add to it, right? So one thing we could do is we could put comma, and then location equals United States. And that comma acts like an and, right? So it then returns the same exact result that we had before. And so the comma says and, right? So filter so that this is true, and this is true so that the date equals 1028 is true, and the location equals the United States is true. If both of those are true, then we will return that row. If either value is false, then we do not return that row from the data frame, which again is why we get one row back. So I'm not a big fan of using the comma. What I like to do instead is to replace that comma with the ampersand that ampersand says and. And the reason I like that is because we can very easily build out more complicated searches or filter arguments to return things that we're interested in, we get the same result if we use the date ampersand location. So that's an and statement. What about or statements, right? So maybe I want multiple locations. Well, we can again do a filter statement, or we can do O it. And I will grab this filter statement where I did location equals United States. And you'll recall, we got a whole bunch of rows there, right? And I can build out an or using a vertical line. This is often called a pipe. But we all know that in our this character that I have highlighted is also called a pipe. So it's a little bit confusing what we mean by pipe. So this vertical bar indicates an or. So I could do location equals equals Canada. I could do location equals equals South Korea. Yeah, and then we could do say location equals equals Brazil, right? And so then this gives us 2500 rows, and we can't really see anything. So at the end of this, I'm going to go ahead and add a count statement. So we'll do count on location. And so now we see that we have 611 rows or time points for Brazil 642 for Canada, South Korea and the United States, right? So we could keep building out this long set of or statements for all of our 15 different countries. And again, that or means that let's go back to just the United States and Canada for now. That if the location is the United States, then we'll return the row. Or if the location is Canada, then we'll return that row, right? So location can't be both United States and Canada. So it works for it to be an or, right? And so if either of these statements is true, then we get that row. If both of the statements are false, then we don't get that row, right? And so clearly, then if both, I guess both statements can't be true, right? But you could imagine creating filters where you would have an or where the left side would be true and the right side would also be true, right? So again, this is what we get for Canada, the United States, about 642, 646 different time points for those two countries. Now though, let's say we want to get the United States and Canada data for 1028. Well, what we'd be tempted to do would perhaps to be to add that ampersand and then say date equals 10, or sorry, 2021-1028. This is actually the ISO standard notation for dates, not 1028, 2021 or 2810, 2021, but year month date. And I made a common mistake that I frequently make. This was not intended, it just happens, right? And so what we see is error unexpected equals in OID blah, blah, blah, blah, blah, right? And so what that means is that I actually included an equal sign, an assignment equal, like you're assigning a value to an argument. And what I really meant was the logical equal, right? So the double equals with date. And so now I get the output, I no longer get that error message. So you'll often get that error message if you have a single equal sign rather than a double when you're trying to do some type of logical comparison. But what we see is that we get results that are kind of funny, right? So we get one time point for Canada, before it was 642, I think, and we get 646 for the United States. So what's going on? Well, again, if we look at our query and read it from left to right, we say, okay, give me the situation where the location equals the United States, or the location equals Canada, and the date equals 2021, 1028. So it's basically requiring this to be true, or this to be true. And what we really want is we want this to be true, and this to be true. How do we ensure that we're expressing our query correctly? Well, we can use parentheses, right? So we can wrap those two location or statements together in parentheses and then say the contents of the parentheses has to be true, and the date has to be true as well. So now we get one row back from both of those. So let's go ahead and remove that count and see what we got. And so now we see we have Canada and the United States for 1028, 2021. And we've got that Canada has more people fully vaccinated than the United States. So again, the important thing to keep in mind with our logical questions is that things read left to right, and there's the same kind of order of operations. And if you want to make sure that you're comparing the right things to each other, put them in parentheses, that PEDMAS or PEMDAS, whatever it is, the P is parentheses, right? So if we put things in parentheses, we know that those things are going to be compared first. And then we'll compare between sets of parentheses, right? So we could build out this filter statement with the other 15 countries, separating those 15 locations with these vertical lines, and again, then capping it for that date. I'm going to hold off on doing that though, because we're going to see how we can do that using a join in a future episode. But for now, I'm going to be content with these two examples of location United States and location Canada. So again, it's important to remember that the arguments to filter have to evaluate to a logical value, right? What we have in here evaluates to a true or a false value. And that can be kind of cool if you think differently about your data. So again, when we look at our OID data, we have character data, we have date data, we have doubles, right? Well, we could also have logical data, right? So I could say something like OID, and I could pipe that to a mutate, I could say above half equals fully vax greater than 50. And so what's going to happen here is that we'll have fully vax greater than 50. That's a logical, right? That's going to return a true or false. And then I'm going to have a column called above half, that will be true or false, depending on whether or not fully vax is greater than 50. So again, I'll go ahead and do a tail on this. And what we see is that Zimbabwe has these values less than 50% and those are false. Of course, we could also then pipe this to filter location equals equals United States. And let's go ahead and add the tail there as well. And so what we see then is for the last six time points, between October 23 and October 28, fully vax was greater than 50. And so above half is true. So if I wanted to, I could come back to this pipeline and instead of location equals United States, I could say above half. Okay, just that, that's all. And what we'll get back then are all the rows from OID where above half equals true. And we see that there's 4,389 rows from countries that are above half. And we could again throw this into account on location. And we see that there are 97 countries that have at least one time point. So I guess they are above 50% in their vaccination efforts. So good job, those countries. Again, what I'm showing you is that you can use a column that is a logical to filter your data frame, we could of course combine this with other filter things, right? So we could say above half and date equals equals 2021 1028. And so then we see there's 40 countries that are above 50% and have data from 1028, there's no other countries that are above 50%. But for whatever reason, don't have data for October 28. Again, this filter function is really powerful and allows you to make all sorts of queries of your data to produce new data frames, where we are filtering to get the rows out we want. So it's important to keep this straight that select selects columns filter filters for rows. And so they can both be used together to get a data frame that suits us and that serves us well for future analyses, data visualizations or whatever we want to do. So there's also special functions that act a lot like filter that are perhaps built with the same kind of machinery as filter, but are simpler to write than filter. So again, if we look at OID, we notice that the first 10 rows are from Afghanistan, and there was no data from Afghanistan in February and early March of 2020. So I'm not really interested in those rows where we have NA values. So how would we get rid of those? Well, again, we could use a filter function, right? We could do something like filter, and we could use a function called is dot na. And we could then do that on fully vax. And this then gives us all the rows where we have an NA value. If we want things where it's not fully vax, where it's not true that it's na, we can use the exclamation point. So exclamation point makes true false and false true. Very cool, huh? So now when we look at this, we get all the data, the 26,009 rows from our data set, where we have data on fully vax. So there's no na's in here for the fully vax, right? So this is such a commonly used statement that there's an easier way to do that. And that is we could do oid, and we can pipe that to a function called drop underscore na. And then if we don't give it anything, it removes any row where there's an NA value in any column. But if I want to drop those rows from the data frame that are na for fully vax, I can give drop na fully vax, and I get the same exact data frame out that I had before. Again, you don't need to know drop na if you know filter, but at the same time, this filter statement is a little bit more cumbersome than drop na. And this is a bit more expressive about what's happening in your operation. Now we've got to clean up this whole mess to get what we actually want. I'm going to go ahead and use both this filter to get the location, United States and Canada and the date, as well as that drop na. So I will go ahead and add this to the rest of my pipeline here. And then if I look at oid, I see I've got those two countries. Again, we are slowly working our way through a pipeline of taking data from our world and data and getting it into a format that's easy to work with in our that's going to make it really easy to automate this so that we can update the data when you know new data is released, and that ultimately is very reproducible and easy easy ish for us to maintain and then link in with that Ipsos data so that you don't miss the next exciting episode where I talk more about these steps in my dplyr pipeline. Please be sure that you've subscribed below and that you click that bell icon so that you're notified when the next episode is released. I would love to have you come back and see how we're working our way through this. If you have any questions, feel free to holler. I'm happy to answer them in a future episode. Keep practicing with this and we'll see you next time for another episode of Code Club.