 In August and October of 2020, Ipsos surveyed people in 15 different countries asking them whether or not they would receive the COVID-19 vaccine once it became available. Well, of course, several months later, the vaccine became available. I received it back in January and February, I believe. Hopefully you've received it as well. And so what we can now do is we can take the measures of people's intentions to receive the vaccine back in October of 2020, and we can compare it to whether or not people in those countries received the vaccine. We've gotten the updated data from those 15 different countries, as well as all the other countries that report their data, from a website called Our World in Data. I've been abbreviating it as OID. And in the recent episodes, we've been taking the Our World in Data data and cleaning it up to get it to a position where we're now ready to join it, to merge it with the data from Ipsos. Along the way, some of the things we've had to do is to filter the data so that we get the most recent data from each country, where we know the percentage of people that are partially vaccinated or fully vaccinated. We now, like I said, are ready to join together these different data frames. And the ability to join different data frames is one of just the coolest things I find about Deplier that makes things so easy. Before I found out about Deplier and was using R and using Baysar, this is one of those operations that was just so error prone and just, I don't know, had me in knots and was just a real pain. But using these join functions from Deplier is a real game changer and really makes things so much easier. And that's what I'm going to show you how to do in today's episode of Code Club. So I have my code where we get the data from Our World in Data and we clean it up. I'm going to go ahead and load all that. I also have our old code from building out the various plots we made with Ipsos's data. I'm going to go ahead and copy that code where we read in the CSV. That CSV is actually part of the repository. So if you want to get a hold of the code that I'm using, you can go down below. There's a link to a blog post as there are for every one of these episodes that will show you how you can get the code and get the data that we're working with here. I'm going to go ahead and copy this and bring it back over to Comparison Figure. And I will put this at the end. And I'm going to call this Ipsos. So we've got OID and we've got Ipsos. And I need to get rid of this pipe here at the end. So what this pipeline is doing that we're now calling the output of Ipsos is it reads in the data from a CSV and it renames those columns. We now have Ipsos, which again are those 15 countries with their percentage from August and October and those 15 different countries. We actually also have a row in here for the total. I believe we figured out that that was actually the average across the 15 different countries. And then if we look at OID, we have those same 15 countries with the location, the percent of people that received at least one jab of the vaccine and the percentage of people that are fully vaccinated. I know that the China here is an NA. I think I'm going to do my analysis with the percentage of people that are fully vaccinated rather than all vaccinated to get that data point. We'd have to go back more than a month. And it was a pretty high number already, you know, just a little bit higher than the fully vaxed. So they're in pretty good shape. So I don't think it really matters that much. Now what we need to do is merge the OID and the Ipsos data frame. And there's a variety of ways that we can do that. The first thing to notice is what column would we join the data frames using? In the Ipsos data frame, we call the column country. And in the OID data frame, we call it location. Well, we could rename location to be country, right? And so that we could do up here. And I'm going to hold off on doing that for now because I want to show you what can you do if you have two data frames with different columns that you're trying to join on. So I'm going to come down and do a little bit of work here in the terminal with you. So we'll start with a function called full join. So we'll do full join on Ipsos and OID. And we will then join by. And so we use the by argument with a C vector. And here we're going to say what two columns do we want to join on? So in Ipsos, it was country. And in OID, it's location. So we join that. And now we get a complete data frame of all 15 countries plus the total. And we notice that although we don't have the all vax and fully vax data for total, it's putting in an NA value for us when we do this full join. So again, the full join looks at the column that you want to join on. And it returns a data frame with all values that are found in those columns that you're joining on. If that doesn't make sense, then let me show you what happens when we do an inner join. So if we do inner join on Ipsos and OID, and then we also do the by equals C with the country equals location, we now see that we only have 15 rows back, right? And so the inner join requires that for the data frames to be joined together, that you only return the rows in the new data frame where that column has shared values, right? So OID didn't have a total value in the country column or in the location column. And so that didn't come in the output, okay? So inner join is what I often use because it basically performs a filter, right? So we use the country labels from Ipsos and filter the data from OID. Another type of join that we might think about doing is a left join. So again, we could take the same syntax, but instead of inner join, we could do left. So what left join does is it uses the names in the column that you're joining on. So in this case, country, that's in the left hand data frame. So Ipsos, right? So the Ipsos data frame had a value in country of total. So our output has total, right? If we were to do right join instead of left join, you can almost predict what happened, right? Well, now we use the data in the location column from the data frame on the right. So OID. So OID didn't have a total value in location, therefore it's left out. So for this data, right join and inner join look the same. And left join and full join look the same as well. But that won't always be the case. So another type of join we can do is what's called an anti-join. And the syntax is again the same as all these other function calls. So we'll do anti-join. And what anti-join does is this will tell us what is an Ipsos that's not in OID. So basically what from Ipsos would not turn up in a output of an inner join. So we should get total as the output data frame. And sure enough, we get a data frame that's outputted. That's basically the total row from the Ipsos data frame. Now, what if you wanted to know what's in OID but not in Ipsos? Well, we're going to flip it, right? And so we'll do anti-join, OID, Ipsos, and then buy. And now we need to flip location and countries. So we'll say location equals country. And now we see we get a table with no rows as the output, because everything that was in OID is also in Ipsos. So these are the five different joins that you frequently will use with the plier. So a full join, you can think of as returning all of the data. Inner join will only return the data that is shared between the two data frames. Left join will return the data frame that's dictated by the data frame on the left, whereas right join is dictated by the data on the right. An anti-join will tell you what rows are left out of the output when you would do something like an inner join. Now, I mentioned that you can think of these joins also as a filter. Right? Now, what would happen if I go ahead and remove this filter function? Right? So, well, if we run this, we will now get OID to represent all of the countries in the database with those values of partially vaccinated and fully vaccinated for those countries with the most recent data. And so here we go. We have 226 countries worth of data here. And so what we would like to do is we'd like to join OID with Ipsos. Again, you can think, well, if I do a left join with OID and Ipsos, again, buy, and then we can then say the column location equals country. We now get all of the rows that are from the left hand data frame, right? So the OID in this case. And anywhere there's missing data, it plugs in the NA. So that's not totally what we want, right? Alternatively, we could do right join. And so in the right join case, we then see that we get all of the rows that are found in the Ipsos data, right? And so in this case, total is not found in the OID data. So we get NA values for that total row. So this isn't exactly what we want either. And I can already tell you we don't want full join because full join is like everything, right? We want an inner join. So we can then do inner join. And then we get back 15 rows, all 15 countries. We don't get that total line. And we get the percent August, percent October, the all Vax, fully Vax from those two data frames joined together. So this is really an example of how you can use inner join to filter one data set using another. We're using the Ipsos data to filter the OID data. So I'll go ahead and put this as part of our code. I'll comment it out for now so we don't accidentally run it. And I'll go ahead and delete this big long filter statement that we ended up not really needing. And one thing that I commented on is that an alternative to using that by statement was to actually rename the location column in OID to be country. So let me show you how we can do that and how that would change our inner join syntax. So again, with select, we can rename columns using an equal sign. So I could say country equals location. And then I'll go ahead and put date on its own row. And then I'll group by country. And now if I look at OID, I see that I now have country as the column name. So now if I want to join these two data frames together, I can do inner join, Ipsos, OID. And actually that's all I need. So I can run this and the output that you'll see is that table with 15 rows and five columns. And it tells us that it's joining by country. Now my preference is always to try to be explicit in what I'm telling R to do. So what I would do is to add the by argument. So I would say by equals and then in quotes, country. And again, this gives you the same output without that little message that we got before at the top of what it's joining by. And so now we've got our 15 rows and our five columns and we're off to the races. We're ready to go in the next episode to take this data and plot it. So I'll go ahead and save the output of this inner join to be Ipsos, OID. And I can change my buy to be by country because we changed the location column to be a country column in the OID data. And so like I said, now we're ready to go and plot the data in the next episode. Again, I hope you found this discussion useful of how we can use different joins to merge various data frames and how we can use an inner join to filter one data frame using the values in another data frame. Also, I appreciate that kind of like this example, we get data out from the wild and different places call things different things, right? So OID called those countries location. They also had continents in there besides the locations, which is okay. And whereas the Ipsos had things called country, even though they weren't all countries, they had total in there. Anyway, those joins are flexible enough that they allow us to join two data frames on columns that have different names but represent the same type of information. Encourage you to play around with this with your own data. This kind of gets us into the realm of thinking about our data as being relational. And it's just like I said, a game changer for me in terms of being able to combine different types of data, where perhaps I have a set of metadata describing samples and then I have other measurements of those samples and I can then use inner joins or whatever to join those together. And it's just written really nice. And we can see in this case how nice it was to be able to take data from the OID website and join it together with the data from Ipsos. And then great thing about this is that it's automated, it's reproducible, that I can come back tomorrow in a month from now and rerun it and get the same output or I guess updated output without having to change anything. I have nothing hard coded in here in terms of dates, in terms of countries. That's all defined by the data that I have. And that's, I think, really cool and really is a testament to the value and power of these deployer commands. So make sure that you're subscribed and that you've clicked that bell icon so you know when that video comes. And we'll see you next time for another episode of CodeClub.