 If you've been following along in recent episodes of Code Club, you know that I've been looking at different ways of visualizing data that was published by ipsos about a year or so ago, looking at people's willingness to receive the COVID-19 vaccine. We had survey data from people in August and October of 2020 asking them whether or not they'd be willing to receive the COVID-19 vaccine when it became available. ipsos in these two surveys at least looked at 15 different countries. So here we are a year later. I've been vaccinated. Have you? Hope so. Well, the question now is we have the data from say October of 2020 saying, you know, who would be willing to get vaccinated? And again, here we are a year later in October of 2021. I'm moving into November of 2021 thinking, you know, well, how many people actually got vaccinated in all these countries? So that's what I want to do. I want to build a figure where on the x axis, we can put people's willingness to receive the COVID-19 vaccine from back in 2020. And on the y axis, I want to put the actual percent of people who have been vaccinated in these 15 different countries. We're not going to do that in today's episode, because there's a few steps that I want to show you along the way that will allow us to explore more of kind of the nitty gritty details of working in a tidy verse package called dplyr. We've been using it as we go through building out these different visuals, but I want to kind of take some time and talk about some of the bigger functions and some of the cool things that I really value about dplyr that yeah, we could do it in one episode, but then we'd lose all this great opportunity to learn more about dplyr. So that's what I'm going to do in the next few episodes is go through and show you how we can use dplyr to take data from one source, combine it with another source to tell a news story. To create those data on the y axis, I need to know how many people have received the vaccine in these different countries. So something that I could do is I could come to Google and I could say something like how many people have been vaccinated in, let's say Australia. So we've got the number here for Australia, for India, United States and Brazil. We could repeat the searches kind of hunting around for the different rates for the 15 different countries in our survey. This gets painful. Why does this get painful? Well, you're doing a lot of manual work here, right? And you know, I'm just going to first shadow that might be a month before we finally make that figure. And over that month, well, the numbers might change, right? So then we're going to have to repeat all this to get updated numbers. And perhaps the numbers won't change a whole lot. But if it's something like the early days of the COVID pandemic where the numbers were changing really rapidly, I wouldn't want to have to do a lot of searches to try to find the actual rate or the actual numbers in these different countries. I would like it to be much more automated, or at least to have the data all in one location. Another concern is the reliability of the data. Where are these data coming from? Do we trust them? This is being offered up by Google. And so I'm reasonably confident that this is like reasonable data. But you know, there's a link here called about this data. So this opens up a page from Google that tells us something about the vaccine statistics data, and, you know, some caveats to be aware of with the data. But we all know that, you know, data that's posted online might be old, might be a couple months out of date. It might be current. We don't know, right? And so that's a challenge when we're grabbing data off of a website, we would like to trust Google in something like this. But at the same time, you know, you never know. Another challenge is that if not all of the countries are represented in a single search, then we might be mixing and matching different methods. And so it'd be nice to get all of our data from one location, because I don't want to get a certain percentage of vaccination or definition of what it means to be fully vaccinated or partially vaccinated from one website versus another website. So some background reading I was doing, I know that in some places they consider people that have had the virus to already be partially vaccinated, whereas other places don't count those people at all, you have to actually get a jab to be counted as at least partially vaccinated. And then what does that mean to be fully vaccinated, right? So again, we would like to get all of our data from one place. And so kind of doing this hodgepodge of Google searches to pull together our 15 country data set, it's probably not an ideal approach. One thing I notice about the output of the Google search that is intriguing to me is that this visual down here for Australia showing the total number of people that are vaccinated over time says that this came from our world and data. Now, our world and data is a website that I've learned about mainly during the COVID pandemic as being a place to get really large international data sets about interesting questions, including COVID and COVID vaccination rates. So I know that our world and data is a reliable source to get information about COVID based on reliable sources, other people I trust out there in the community and where they're getting their data from that, you know, I hear our world and data mentioned a lot in a positive way, right? This is this is reliable data that I'm confident to use for my analysis. So as I scan down through here, I see a really nice dashboard that's going to give me a lot of useful information. I think from this dashboard, I see that 63% of Australians are fully vaccinated against the virus. Another 11% are partially vaccinated. And a total of 74% are at least partially vaccinated against COVID-19. Again, I could kind of scroll down here and see how they're defining different things and how they define things like partial or fully vaccinated. So that's great. I think this will be a great resource to get the percentage of people by country that are currently vaccinated against COVID and that we also see the date here the October 28, 21, right? And so I could come down and look at some of the other countries that I have. And actually, I don't have to search through, I could type it in here, I could do United States. And we now see United States relative to Australia, 57% of people in the US are fully vaccinated, 9% are partially right. And so we could keep doing this for all 15 of the countries in our data set. So this this is improving things, right? We have a single source that we trust for getting the data. We have so we have consistent definitions. It's reliable. We kind of know the provenance of where the data are coming from. All of the countries are surveyed here. And you know, we can get all 15 of the countries so we could then map those over. But at the same time, it does get to be painful to have to update. So if I want to update this in a month, what am I going to have to do? Well, I'm going to have to go back through here and again do the search. So while I might not be doing 15 or however many Google searches, I do have to do 15 different searches for the countries here in their COVID-19 data explorer. So there are certainly research questions where this is the approach that we have to take. We're just kind of stuck, right? There isn't a way to get data directly out of the database at the website, right? And so you might you might just be stuck with having to manually go in and get numbers off of a website. What would you do with those numbers? Well, one idea would be that we could come into our studio. And we've already got this August, October 2020 CSV file. Let's go ahead and view it. And so again, this is our comma separated values file. And right off the bat, I'm reminded that I actually downloaded this file from the Ipsos website, because we have these kind of funky column names that one of the first steps that we did when reading it in was to change the column names, right? And so I could add a fourth column, which would be say actual October 2020. And then I could go ahead and add in a fourth column for all of these countries. That's not so ideal. One of the practices that I really try to preach to people, and especially in my lab, is what we want the raw data to stay raw, that if I download this file from Ipsos, I'm getting the raw data. And I want to leave it that way, right? Those ugly column headings and everything. If I then have other data, right, that I'm getting say from our world and data from kind of pulling them manually out of their website, I probably want to go in and create another CSV file. Okay, we could go ahead and do a text file. And I'll go ahead and save this, perhaps as OWID, our world and data dot CSV, right? And so then we could say country and rate, right? And so we can say Australia. And what was the rate here? So let's go ahead and use at least partially. So we'll say 74. And then United States. And then we'll say 66. And you can already see as I'm typing, how error prone I am, right? I'm introducing typos that I'm thankfully catching. I don't really trust my ability to get the actual 66, right? And but regardless, we could keep doing this for all 15, and maybe double or triple checking things. Also here, we're only looking at two columns, right, the country name and the rate, that's relatively simple, as we're perhaps increase the number of columns, one of the challenges becomes linking rate with the column for rate with the data for rate, as these country names get longer, you know, that 66 is no longer under rate, kind of like the 74 is under the rate for Australia. So that becomes challenging. So one thing that you might think about doing is firing up Excel in doing it in Excel. So again, in Excel, I could do country and rate. And then again, because I got these nice gridded tables, I could easily type in Australia, United States. And then I could put in the rates, which again, we're 74 and 66, right? And so now everything's lined up and nice to work with, I could then use a function from the read Excel package that comes with a tidy verse called read Excel, to go ahead and then read this in to my RStudio session. We've done this before a lot with clinical metadata, where perhaps we have lots of sequence data on a bunch of patients. And the clinical data is coming to us from a collaborator who stores everything in Excel workbook. Then I can easily read in data from the Excel workbook. And variably, there's always some funky things out what people have typed in that we then have to clean up. But this is always a good strategy. If we can't get the data directly out of the website, like we were able to with that original ipsos data, I want to go back to the our world and data website though, because I noticed a tab down here at the bottom that says download, that's always something that alerts me like, Oh, good things are here, because we might actually get the real data, kind of like we did from that ipsos website. And so wow, there's a nice shiny blue button here saying download a CSV containing all the data used in this visualization. So that causes me a little bit of pause. Because I don't want just the data from these two countries, I want all the data. So I'm going to do is I'm going to go ahead and download this and see how big it is and see if it's got other countries in there than Australia and the United States and perhaps more time points, because perhaps it's all the data, not just all the data in the visualization, I went ahead and put the downloaded file into my vaccination attitudes directory. I see it's 33 megabytes. So it's clearly more than just the data for those two countries. I'm going to go ahead and open that and view it here in our studio. It's too large, unfortunately, because it was too big to open in our studio, I went ahead and opened it here in Excel. And you can see right off the bat, there's data here for Afghanistan. There's a lot more data than just for the Australia and the United States. And if we scroll down, we see that there's data here from Zimbabwe. So we've got a whole total of about 127,000 rows in the data frame. And I don't know how many columns there are. There's a lot, right? There's through BM. So they went through all the single letters, they went through a through a through AZ and then BA through BM. That's probably something like 70 different variables. There's a lot of different data in here, far more than the data that was present in in that visualization. So this is awesome. We can now read this in to our studio and use that to kind of help filter and select down the data that we want to join in with our Ipsos data. I'll go ahead and create a new R script. And I'm going to call this comparison figure dot R. And as we always do, I'll go ahead and load the tidy verse package with the library function. Make sure we've got all those goodies loaded. And with the tidy verse comes a package called read R. And so read R has reader functions in it that work really nicely and better than the read functions that come in base R. And so what we can do is we could do read underscore CSV, because we have a comma separated values file. If you look at the last three characters in the file name that we downloaded, they were CSV. And then in the parentheses, we could put quotes. And then we could put OW ID, COVID data. And what one of the nice things about working in our studio is that I can start typing the name of the file, and then I can hit tab, and it'll auto complete it for me. I'm going to go ahead and call this OID as a variable name. And it's a big file. It's got again 127,000 rows, 65 columns, but we've got it read in. And I can always down here go OID to see the first 10 rows, or you can always do view, capital V view, OID. Actually, I think the lowercase V works also, this opens up a spreadsheet view of the data frame. It's read only, you can't edit it. You can also get here by coming over to your environment and double clicking on OID. And that will then again launch the spreadsheet view of the data. I just want to see if view OID in lowercase the lowercase V view works as well. So great, I'm going to go ahead and close this. And I'm going to go ahead and close my OID CSV file, because I don't need that because we're going to get it directly from OID. So I'm going to go ahead and delete that OID CSV, and get rid of that. We are now ready to march ahead with this data analysis. And so that's really cool. One thing I'm thinking about though, is that this is a really big file, right? It's 33.3 megabytes. I can probably store it in GitHub, but it's big, right? And the other thing is that I think they're putting out new data every week, or maybe even every day. And so this file is getting updated regularly. And so while it's nice that all the data is in one file, I can go back to that website and click the button to redownload it. I might be kind of losing track of my different versions of the file. And, you know, it's just perhaps not as convenient as getting the data directly from the website. And so you might be thinking, well, what are you talking about? How do you do that? Well, one of the really cool things about the read functions in R is that, yes, you can give it the actual file name of the file that's on my computer. But you can also give it the URL, the address on the web, to get that file, which is really cool. This is something we weren't able to do with the Ipsos data, because the Ipsos data was in a file that was hidden behind a bunch of, you know, messy JavaScript. But I think we can do that here. So coming back to the OID dashboard, if I put my cursor over this beautiful blue download button, I look in the bottom left corner, and I see that there's an actual URL there. And so if I've got an actual URL, I can use that as the argument to my read CSV function to download it without having to physically download the file. I'm going to do that. So we're going to copy link address, and then I'm going to replace this file that's on my hard drive with the URL. And so now if I look at this, we see that it's downloading it and reads it in magically, right? And I can go ahead now and delete this OID COVID data CSV file. I don't need that. Yes, I want to get rid of that. And again, I can load the OID data frame without having to actually download the data, download the data under my hard drive, right? So we're still downloading it, as you saw, kind of in the output from read CSV, but this file does not exist on my computer. And so that's kind of cool, right? And what that means that I could run this script every day, every week, every month, and see how my figure is changing over time, which is pretty cool, right? One of the downsides of this, of course, is that we are dependent on the folks at OID. If our world and data goes belly up, or if there's some network outage, then we don't have access to the data, right? So there's a bit of a trade-off between actually having a physical copy of the data on your computer, versus the flexibility to be able to run this without having to worry about downloading an updated copy. I can run this. And as long as the website is up to date, and it's they've been paying their bills, I should be able to get this data into my R session. There's another trade-off here besides kind of the expectation that the data will always be available. And that is the speed of your internet, right? And so it took a couple seconds for me to read in from read CSV, because it was going in getting the data and reading it in, versus downloading the data, which you do once perhaps, and then reading it in, right? So it's still not instant to read in from the physical file, because it's pretty big. But, you know, there's trade-offs, right? And so I like this for the reasons I've mentioned. You might have a slower internet connection and just say, you know, I don't want to have to run this command over and over again. I'm going to download the file once physically, and then re-read it in with read CSV. And if I need to get an updated version a month, that's fine. I'll go ahead and download it and roll with it. Make sure that you're subscribed to the channel so that you know when that next episode comes out. I do want to spend a few episodes here talking more about dplyr, and I think you'll get a lot out of it. And I'm really excited to see what our final figure looks like when we compare what people said they would do in 2020 to what they actually did in 21. Keep practicing with these concepts, and we'll see you next time for another episode of Code Club.