 Zoom recording. There we go. Great. Thank you, Drew. Drew is my colleague in the Center for Data Visualization Sciences, and it's actually a specialist in GIS and spatial analysis. So he's a handy person to know. Okay, so then I'm going to expose you to how to do these queries and how to do the visualizations and hopefully point out some useful documentation along the way, because quite honestly, there is a ton of documentation. And if you're anything like me, while you recognize the value of reading the documentation, you often need to read the documentation. Sometimes you just want to get started and see what you can do. And so I'll try and point out where that documentation is so that you can reference it later. But in terms of background, one of the things I said in the registration is that we're building on earlier knowledge of what we know already about R and the tidyverse. So workshops that I've taught in the past. So if you don't know about it, there is a website called R Fund, which is a sort of a sub branded site for my department, where I'm collating a whole bunch of R workshops that I've done before. And we're basically building on the first two sections of quick with quick start with R. So in particular, just to do a little bit of a review, let's take a look at the plier cheat sheet. Now you can get this cheat sheet from within our studio. And you can also get it from this location right here. If you just free text, search the word the plier on there, you'll find this cheat sheet. And so just as in terms of a slight review of the plier, what the plier allows you to do basically is to transform data, right? So the verbs that we're really going to be concerned with today are filter and select and mutate and arrange. Okay, so let me expand my screen. So this is a little bit easier. I won't belabor this too much. But filter, if you have a grid based rectangular data frame and just want to limit the rows or the cases or the observations, depending on what you call it, you would do that with filter. So in this visual representation, you're limiting a four row data set down to one row, with the dark gray being the header. If similarly, we can select or limit or subset by variable or column, that's the select function. And so in that case, it's selecting from a three column or three variable data frame down to a single data frame. And I think most of you probably know what I'm talking about. So again, I don't want to belabor that, but I just want to review. This is what we're going to leverage is the ability to do these kinds of things and have that and have our or our studio or the tidy verse really specifically the plier transform what we already know about the plier and it'll do the transformation into a different query language. So two more functions that are important. One is mutate where we can create new variables. Right. So if we have a three variable grid, let's say that's a b and c and we want to add a b and c together, we can create a new variable called D that is the sum of a b and c or whatever kind of math we want to do. And then lastly, a range, which allows us to sort the rows or the cases or the observations by a variable in one of the columns. Right. So I think everybody already knows that, but that's what we want to leverage. We already have that knowledge. So let's see how that gets done. And go back to my slideshow. Sorry, this is the way I want to start. We were just talking about grid based data frames, right? Technically, this is what we're going to call a flat file and folks who are using are are going to be very familiar with flat files. This particular glimpse of a flat file just like the first eight rows is from a dataset called empty cars. I don't know that it comes on board with R, but you can easily get access to it by loading, installing the data sets library. And it's very similar to us. It's grid data. It's not huge and we can apply all of those supplier verbs that we've learned to transform the data so that we can eventually visualize. In contrast to a flat file is a relational database. Now there are lots of different kinds of databases, but relational databases have been around easily for 40 years. They're very robust, very mature technology, and they were designed for a couple purposes, but one of them is to handle voluminous data. So the difference between a flat file, which you can think of as a single table and a relational set of tables, is that they all have keys that link one table to the next. So this whole relational database is information about sales of some kind of product for some kind of company. So this one small table here, which is very efficient, really consists only of IDs or keys to other tables, plus the number of units sold. So if I need to know, let's say I sold, I have one row that says I sold four units on February 23rd, I can use this key to go look up the date in the date table for February 23rd, and I can get more information about that date time. I can similarly look up the key to get more information about the store that sold the product and similarly use the product ID to get more information about the product itself. So one of the things that's happening with this relational table is it's a very efficient way, a very compact way to store volumes of data, but it's also a very efficient way to do a lot of computational subsetting and analysis just happens to be not what we do in R. So there's a whole series of database engineers and database administrators who know that stuff very well, but for my purposes, I'm going to pose as a person who doesn't know too much about SQL, and that's exactly what gets what happens here. So just in terms of a comparison, this is a very generalized comparison, difference between a flat file and relational file, and all of these very generalized comparisons have footnotes or exceptions, but the flat file is usually good for a single user, whereas a relational database is going to be good for multiple users. Relational databases are usually in the cloud, whereas a flat file is likely going to be just on one single disk space. You can do simple queries with a flat file, you can do very advanced queries with a relational database. Flat files are usually static, they don't have to be, but most of the data that I'm analyzing in R doesn't change a whole lot, changes very occasionally, whereas I don't know if this is true that relational databases got developed for the banking and finance industries, but banking and finance industries quickly adopted them, partly because their data is very dynamic, and they had issues where there could be collisions of data where multiple people are putting in the same information, and relational databases are just designed, most of them, many of them, the ones that are in existence are designed to handle this kind of dynamism. There are different kinds of relational databases that have different qualities, but let's just stick with that, stick with the general. Flat files are usually only good up to a certain size, whereas if you get over two gigabytes, is that right, two gigabytes to three gigabytes, actually right now I'm forgetting the unit, but at a certain size R will start to break down, you just can't load that much into memory, whereas a relational database is really designed for these volumes of data, and of course the structure can be much more complex, like I was showing there. Okay, so these complexities and these nuances, sure they're a good thing, but I want to caution, I am not trying to sell you to move all of your data into a relational database, that's not my purpose here, there are many good reasons to keep your data in a flat file, not the least of which is the amount of effort you have to put into administering your data, the backups and things like that, but there are several good reasons to pursue the more complex data structures and deliveries, you should just be aware that as you pursue these more complex approaches to managing your data, you have more expense, right, there's more computational expense, there's more financial expense, there's more administrative expense, and when we start talking about really really large amounts of data, we start then adding in at least large organizations, people like database engineers and database administrators, to ensure that there is some data integrity. What I'd like to do is at this point kind of, to just to talk about that data integrity is to sort of underscore that, the reason why there are people with focusing on this is because we have come to know, if we didn't know it before, companies like Twitter and Google and Amazon have sort of made it really clear that data starts to become the new oil, as it were, meaning it's very very valuable, and so if you're going to put all of your data into one system, you need to make sure that it doesn't get corrupted, that you can continue to continue to get the data out, things of that nature, so I just want to underscore that, I am not trying to tell you that you should put your data in a database, but you may be running a class situations where you have so much data, if you want to do that, and I want to make sure that you're aware that among other things, R can access that database. Also, it's just cool that we're going to use Google BigQuery as our example. So now, I mentioned earlier, I did a really quick visual overview of the Deployer cheat sheet, and here's a basic Deployer query right here, and if we piped that Deployer query where we're just filtering by a few variables or a few columns, if we piped that to a function that we haven't discussed before called show query, it would actually deliver for us an SQL statement or a structured query language statement, that's what you see over here. These two statements are basically equivalent. Now, they're different dialects of SQL, it is a industry standard, and so there are sort of slight dialects between Oracle's version of structured query language and Microsoft SQL server's version of structured query language, but generally speaking, the basic stuff works the same, and that's what Deployer is able to do for you by being on the front end, it can decide if you're querying if this object right here is actually a database, it'll decide to broker that request through dbplyer, and then it will in the background send a statement that looks more like this to the relational database management system, and then we'll get some data back. I also included this left join statement not that they're related, but if anybody's worked with relational SQL or relational databases, you know because you have a lot of these connected tables that you often have to do a lot of joins and merges to get some larger flat file kind of structure that you can analyze, right? Now in tidyverse context, we don't do a whole lot of joins necessarily, you might be depending on your data, you might be doing a lot of joins, but the syntax for those joins, well I might be able to pull this out of my head because I've worked with databases in the past, the syntax for the join I'm always going to have to look up because I don't do it that often, and so this is one more advantage to doing the Deployer approach, which is I don't have to look up the SQL syntax, I can just use the Deployer syntax that I already know. All right, so some questions that I'd like to address like why would you do this? Well one is it's especially convenient if you already know Deployer, like you really don't have to, especially if you're only querying to bring in more disparate sources of information, you don't have to go out and learn SQL, you can just stick with Deployer and it's probably going to work really, really well until you get into these really advanced SQL queries. And we're not, I'm not going to demonstrate any of those points where it breaks down because I myself generally don't do those super advanced queries. Another reason to do it is it's more efficient, I think I talked about this, but at a certain point you just can't read that much data into your RAM using R, and the general idea is in a big data context I'm repeating myself, but you want to do your transformations and your data processing where the data are so you're moving your compute to the data. And that's exactly what Google BigQuery allows us to do is it allows us to have control over the compute structure where the data exists, happens to be in the cloud, somebody serve a room, somebody, but not ours. And then the other technique then of course is to visualize, and in this, in our case, visualize using ggplot, which we already know rather than having to learn a whole nother, like some other query visualization approach. Last bit just being that another reason to do this is just that it's access, right, that there are a lot of databases available to you. It's good to bring in those disparate sources anytime you can, especially if you don't have to maintain that data. If you could just have some faith, some reasonable faith that that data is worth looking into. So it's not hard to get started. I gave you a link before I go too much further actually. Let me see. I see Jonathan Holman asked a question. In theory, can a relational database always be transformed into a big flat table? I'm going to say in theory, yes. You don't always necessarily need to transform it, however, into a big flat table. You just need to transform it so that you get the data that you want to view. In R, the simplest, most convenient metaphor is that flat table, although R has different kinds of data structures, matrices, and lists to name just two. You don't necessarily have to transform them into the flat files, but that seems to be a convenient way to think about your data. All the code that we're going to look at, I sent out earlier. If you don't have that just pop me a note in the chat box and I'll send you the link again. But what we're going to do is we're going to look at the Google query sandbox. If you haven't already, you want to install tidyverse and the DBI package and big R query, you really don't actually. I sent this out in the registration notes. You don't actually have to install DbPlyer because it will be installed as part of tidyverse. But I just left it in there to emphasize that we're actually going to be indirectly through DbPlyer using DbPlyer. So let's take a look now at Google BigQuery just so we know what we're getting into. All right. So there's some links in the read me file down at the bottom to different kind of views of Google BigQuery. If you'll excuse me for just a minute. So there's a free tier for the Google Cloud Platform and that's I'm pretty certain how I signed up to use it. Now I want you to know this is the first time I've taught this particular workshop and I was reluctant to kind of uninstall all of my connections to Google Cloud Platform just to make sure that I was telling you everything right because then I was afraid sometimes when you're teaching workshops like Murphy's Law has a tendency to to rear its ugly head right when you're about to start the class and I didn't want to undo what I'd already done. So I can't tell you for certain if I'm connected to Google Platform under this approach, the free tier where you have to provide a credit card even though you're not getting charged or if I'm using the Google query sandbox. But the Google query sandbox is supposedly for people who don't have a credit card to provide. So I'm going to link to that and what you'll see here is a setup that explains how the sandbox works tells you everything you need to do and even gives you a link to the Google Cloud Console which we're going to connect you in a second but basically looks like this right. So shifting back here these are one of two ways that you can connect to the BigQuery public datasets and neither one of them should cost you any money. But I'm not a Google product manager, I'm not a Google rep. If it does cost you money I would apologize but I have no way of knowing how it would cost you money. My experience so far I've been using this for I would say well over six months I've not been charged to dime I've only been querying the public datasets and from my understanding you can query like up to a terabyte of data a month without being concerned about any kinds of charges and I'm definitely not going to do that today. So it should be safe and good for you but like I said I don't have so much knowledge about the Google Cloud Platform that I can explain exactly how the pricing works. So it's worthwhile to be cautious. Anyway what we're going to do is we're going to cert we're going to query some of these public datasets. So I'm going to right click and open that table for that tab and here's a fuller explanation of the public datasets and if you click on this link takes you to what's called the cloud marketplace which I only want to show you so you can see it. I personally find it not super helpful but it's kind of like a play like an Apple Play Store just for like what free data I think what free databases were out there I don't think they're charging for any of these. So it's there it's I don't know why but I have noticed some disparities between what shows up here and what we're going to see once I get into my Google Platform my console so I just want to point that out but I don't want to rely on it because really what I want to do is I want to get to my Google BigQuery console. So this link right here which I'm going to put into the chat if you have already set up your Google BigQuery that should go straight to what you're about to see but I'm going to take one more half step to get there indirectly and just go straight to the BigQuery page where it has this blue link that does the same thing. It goes to the console and this explains you know kind of what the BigQuery tool is and how you access it with a little video. My experience with BigQuery is there is there's a wealth of documentation and I've showed you only are provided for you only the links that I think are really really necessary to look at. There's so much documentation that you could probably spend a week just reading documentation without ever actually having accomplished anything but what I want to do is I want to go to the console. So I'm going to click on that and now we're looking at a cloud version of Google Cloud Platform and if I clicked on this little I forget what you call that let's just call it the hamburger icon. You can see that in I've registered onto the Google Cloud Platform and I have at least two services available to me one is the home dashboard and one is BigQuery and then there's a whole bunch of other stuff that I've never used and I don't know if I will ever use it. The other thing that I can see here is a reference to all of my projects and if you don't have a project you might have to click new project. I'm sorry again in terms of excuse me a first time user I don't know if you already will have the Google the big query dash public dash data project pinned to your workspace I think you will but even if you have that pinned you're going to want to create a project. All right so you need a project which might be used for billing if you're on the free team even though I've used one for billing I don't get I don't get filled because I'm using public data sets if you're in the sandbox I don't know that you even need a project I'm sorry I can't help you with that level of Google Cloud Platform detail but that's how you create a project if you need one right and when I created a project one of the projects I created was this one called workshop dash bar fund 2021 dash spring now this is just for me you don't have access to this you have to create your own if in fact you need to do that but let's just put that just put that aside for a moment this is the Google public data sets right here and so if I expand that I can scroll through this really long list of public data and one of the things that's worth pointing out is even when you get down here to this really really long list there's even more right there are just a ton of large public data sets out there a bunch of information about san francisco film locations bike share programs san francisco trees information about stack overflow wikipedia it's mind boggling how much stuff you can actually find here and it's fun to to browse through it and we're going to look at some of these we're going to start with the coded data that is right let's see which one is this one right here coven underscore jhu underscore c s s e right so that's one data set and if I expand that you can see the relational tables that make up that data set right there's one called confirmed cases one called deaths one called recovery cases one called summary um so um we're going to look at the summary table and if I clicked on the summary table you can actually get a listing of the columns or variables that exist in that particular table right and that's what we're going to query now you can do your querying right here in the console there's a there's an editor right here and also if I if I scroll down here I can get a query history of stuff that I've queried in the past um but I'm not going to do that I'm just going to use the plier to do my query so now that you've got this orientation the important part here for me is to know a that you have this project right here called big query dash public dash data to know that you can navigate that to data sets and then ultimately to tables within the data set know that and know that you probably have to have your own project to associate billing that you will probably never be charged for all right so at this point I'm going to go from from this uh repository I'm just going to open up my google project so if you haven't done this before you can download this zip or open the repository in any our way that you feel comfortable stuff that we've kind of covered in the past okay so here's the repository and if I downloaded that zip file I could expand it on my local system and when I expand that on my local system I'm eventually going to get to our studio which is where I'm going to go right now and I want to make sure that I share my our studio so you should see a blue screen now that is dominated by my um my project that I opened up and I opened it up by in my local file system clicking on that file and this is my file system and this is my console right so so let me get rid of those drawings I draw on the screen and go back to my mouse and I'm going to open up this file here called 01 case study google big query and I'm also going to try and make this bigger for you to see um before I do that let me just note because I mentioned it before here under help here cheat sheets and here is data transformation with the plier so that's quick access to that cheat sheet I was mentioning before but what I want to do is I want to kind of alter the appearance of this a little bit so that you can see better all right that makes the screen larger and I'm going to make it a single screen and there should be enough text in here in this are marked down document that you can read through it some of the stuff I've just discussed like if you don't have a credit card go to the sandbox etc but getting to this part of the um line 27 of the code I'm just going to run this first code chunk and I had the warnings turned off and that's going to load these three libraries the the critical ones for querying big query are dbi and big R query again using the plier which is part of tidyverse and the very first thing we want to do is use the dbi library to establish a connection to this remote cloud-based database so this is where the stuff I just showed you in the console comes into play right the project big query dash public dash data if I can share my other screen just to bring this clear make it clear that's this right here okay so that's the project that's the data set and that's the table and this project is what I'm going to use for billing so going back that's how I fill that out that's my project that's my data set that's my billing id and when I run that all I do is establish a connection to the remote database gives me a little bit of feedback there I'm going to use that object going forward now you may have to at this point type this into your console right so let me just make this clear what I mean by the console there we go there we go big query big query bq off and what that should do I don't know if you have to do that or not because oh I spelled that wrong big R query bq off what that should do is broker a connection through your web browser to establish that you have the proper permissions authentic authentication and authorization to query big query remotely but I've done it so many times that it's not going to query me to do that so I'm just going to go on and use this next function table and this is where I'm connecting to that summary table part of the data set the COVID-19 jhu css e data set this is just one relational table of I think there were four so first I made the connection and then I'm going to connect to the table and I'm going to make a pointer for that I can call it anything I want but I'm going to call it my db pointer should probably call it my table pointer and you'll notice as I did that that I have a little waiting icon there it tells me that I have to go ahead and further make my connection through my Google account so I know that for me I just have to go into the console and type the number one choosing one option because I could have all their options and hit enter okay and it makes that establishes that connection now you can set that to all run in the background so you don't have to keep on typing in the number one in your console but I've been proceeding cautiously trying to make sure that I don't accidentally you know run up charges that I don't need to um and so since I don't actually want to get charged for anything I don't mind typing in that number one but if you're doing something more production oriented of course you would not want to do that you would want to establish what your barriers were but break down the connection possibilities okay or connection barriers all right so all I've done there is I have created another pointer that connects to the database and the table now I can use a deployer function called glimpse which for those people who are familiar with base r is roughly the same as stir but not the same and I can use glimpse to approximate this sql query right here if you look at this sql query it's select star from this location right from big query dash public dash data dot blah blah blah dot summary table and there's also importantly I'll just point this out importantly limiting to only getting 10 results because I don't want to query the whole database it could have millions of rows in it I just want to get a sense of what's in there and I can do that with my deployer function glimpse and it comes back and it tells me the stuff that I wanted to know like there are 13 variables what the data types are for those variables it gives me a little preview of each column and what the data are in each one of those columns it does not however tell me how many rows because it doesn't know how many rows because it's trying to be economical in what it's in what it's consuming at the remote database end right so the glimpse command is probably doing something very similar to limit 10 and and as a result it doesn't know how big the data frame is that it's that it's query if I wanted to find that out I could use another deployer function called count and when I run that it'll go hit that remote database and come back and give me an answer which is roughly equivalent to this sql command uh select count star from summary table and what is that number it looks like a million million two one one million two hundred eighty nine rows of data for this this one summary table right here okay so I made that connection now I don't know that I really want to belabor this but I just want you to know if you really want to do sql queries you can do them this is the syntax for doing an sql query I would move this up here to make it a live active um code chunk but I want to I want to leave it that way so that I'm not doing that um but using the db query function to the con statement the connection statement that we made earlier our connection object I should call that we're then sending a full sql query and we can get some some results or if we wanted to write it out we can write it out like this and put that assign it to a to a different object name and that's all that I'm showing here is just different ways to actually send full on sql queries which you might want to do if your deployer queries are not accomplishing what you want um but let's proceed with the assumption that we're just interested in querying that's what the plier is really good for is is doing querying there's lots of other things you can do with databases like inserting new records making you know renaming tables um changing permissions the plier is not going to do all of that stuff but it's excellent for querying uh but if you need to do advanced stuff like that you probably are going to end up writing out sql commands in this kind of fashion right here's a different command that does something very similar to the glimpse command it's part of a different library it's part of the big our query library uh and you'll notice some similarity right here in that um we have the the data source the data bit the data set and then the data table in this case austin bike share and if I run that command it's going to come back and tell me what fields are in that bike share bike share trips table and what the data type is right so it's just one more way of doing it it's a little more sql like uh than the response you would get back from from glimpse which is more tidy verse like but moving on to the plier right let's just go ahead and start doing some deployer searches so you've seen this already we did this a moment ago we're going to use the table command make a connection to the summary table and we're going to give it a name we've already done this we've done it once and we called it my what is that my db pointer I think is what it's called uh but let's just do this again and you'll see up here in the environment variable I have all they are lists they don't like a data like an r data type called a list it's not a very complex it doesn't have a lot of information but it contains all the information that our needs to connect to the remote data set if I actually wanted the data I would have to follow by saying by piping in the collect function and you can do that but just know that uh you might want to do that sparingly it's kind of hard to say what is the exact rule but the the rule that I use in my head is let me let me pull data over the wire or over the internet as in it sparingly and as few times as possible because every time I connect and gather data from the remote data source in this in this context it's somehow expensive computationally expensive possibly financially expensive if you're uh if you have a database administrator who's managing data for you and it's all local duke stuff they probably don't care how many times you hit database server but again I'm trying to be conservative here because I don't really understand when Google BigQuery is going to charge me and I don't want to be charged so there's no harm in me being conservative in this case with how many times I hit the data set but I have this pointer to uh COVID data to the summary table and I can initiate this command and it will pull back in this case a grid table that looks like a standard table that we're used to in Tyreverse in this case it tells me that there are 101 rows and that's I haven't done very much processing here right this is basically one row per county in North Carolina if you look at this query it's querying the summary table where the state equals North Carolina and where the date equals June 30th of last year and then making a selection of a particular set of columns or variables so if I assign that to this object then every time I hit that object it's actually going to re-query the database let me let me zoom back out here in my R studio so that you can see this object that I just created which is right here they're all lists none of them have any size to them because they're all dynamic every time I call this it's going to re-query the database and it's going to tell me well it's billing me zero and it downloaded a page and in this case 101 rows so not a lot of data not very expensive but again we could do more with that if I specifically wanted to see uh what kind of query that was going to produce I could pipe that to show query and before it even goes to um out to the Google BigQuery it's just going to give me back the SQL statement that it generated can be handy because maybe I want to do a more advanced SQL statement but I can just use this as the sort of cheat sheet stub of what I'm going to what I'm going to modify in my SQL query and that's what's happening right here uh but here's a couple more variations of just doing sort of straight up straight up um apply our kinds of searches and if I'm not mistaken it's not going to be until we get here where we actually pull back data into memory and keep it there right so we can run these one at a time and you can watch how the environment changes but you've seen this already that's the glimpse command um this is a particular search which is uh I think the same search I've done a couple times but all it's doing is creating another pointer to the search if I execute that search it will bring back data 101 rows and if I execute that search and say collect after it so my search and then collect when I do that um in this case maybe to make that a little bit more clear let me do this I'm going to say uh my food table and now when I run these next set of rows all at once you'll see that my food table over here in my environment section will actually now have data in it right so that data is all sitting in my my local ramp all right so again this is this idea of whether or not you want to do your computation remotely or do your computation locally all right I think this says everything that I was just going to say so that's redundant but the difference here is I'm going to go ahead and do this with the collect statement is that in this case I'm getting all of the data not just for June 30th which is what's happening right here but I want all of the data since April 1st of last year so all of the Durham North Carolina deaths well summary all of these variables for the summary table but just for Durham North Carolina because that table is huge not only does it have all 100 counties for all dates from April 1st forward but it has all counties for all the data they've collected right and so that's the amount of data that I don't want to pull into my local ramp I'm just interested in data for Durham North Carolina from April 1st forward and that's what I will collect into my local ramp so I could do a further analysis on that data so that's what's happening here is I'm essentially writing my deployer query which gets transformed into an SQL query and then through kind of tidyverse magic brings back a local table 330 rows representing the number of days since April 1st of last year that has information about COVID related issues right confirmed cases and deaths primarily what I'm interested in but notice that the data are not sorted right I've got May 14th and then I followed by September 4th this is sort of a feature of relational databases is that they're not naturally going to be sorted database engineers database administrators will build things called indexes that make the queries work more efficiently depending on the most common types of searches for presumably that didn't happen in this case that's fine because I can use my other deployer verbs to do the sorting right so I brought that back into an object name right there called jhuCOVID19Durham since April if I want to sort that I can just pipe it to my deployer verb or my deployer function called arrange and arrange by date and some of that query took place remotely at the computational site right because I didn't I think that's right jhu yeah because my jhu is remote but I can also apply just I don't mean to make this complicated but one of these is a pointer to remote data and one of these is local data that I just collected so one of them is actually going to happen more quickly because it's local and it's not a lot of data whereas the other one is remote but if we were talking about querying huge quantities of data this one which is querying remotely may in fact be more efficient than this one which would have to churn through all of my local RAM if I had millions and millions of roads there's different ways to do it all right now let's get to the good stuff in terms of visualizing um simplest thing to do is to do something like this where I've got my remote query and then I'm going to do some sorting and eventually do some mutating and eventually pipe it to ggplot right okay so this is essentially the the exact example that I was trying to set up at the beginning which is that I'm doing my compute and transformation of data remotely at Google BigQuery cloud servers and then I'm visualizing locally because I'm pulling back only the data that I want to look at to visualize and this is just a count of the deaths due to COVID since um April 1st and I just you know we're playing around with the visual qualities so that you could see uh both a linear regression and a lowest smooth regression and see the basic trends and I colored them based on um whether or not there were more than one death a day low medium and high okay in this next uh code chunk I'm doing something similar um there's not a lot of difference here uh the only difference this is just more ggplot so it's not super relevant but again it's it's it's focusing the visual the visual computation is focused on my local computer so I can you know I can do all kinds of things to make this plot look slightly different from that plot uh while maintaining most of the data transformation up in the cloud here's a similar thing where I'm just uh renaming some data of some variables and resorting and selecting just a subset of those data rather than all of the columns that I was bringing back originally uh here's a kind of a cool thing just just in case you're interested I don't know if you've noticed this or not but when you when you're working in um in our mark our markdown notebooks you can um right you can start a new code chunk which we typically do this way where we choose an r code chunk but uh we can also start an stl code chunk right here and I typically say that means you have to have some kind of database server installed locally but really the truth is what I found out is you don't have to have the sql server installed locally the relational database server installed locally you can use that same connection property that we created way back at the beginning of this script and then you can identify what the output of the sql query will be and it'll make a special data frame right in this case I'm going to call it special data so if I were to run this query and I will um I'll have a new variable show up in my environment that's similar to all the others I have and it'll be called special data frame right so it's the exact same as this one but this is done just with an sql code chunk rather than an r code chunk um in that way you can also mix other languages right you can mix python if you happen to be a python programmer as well all right so here we have special df and we're just going to visualize specialty it's not going to work any differently than you would expect this is just a scatter plot of deaths over confirmed cases with by tagging a certain two of the higher numbered counties all that happened on I think this is for June yeah June 30th but now let's get into um doing a different query of a different data set right so there's also aside from covid cases there's also the austin bike share um data set which has several tables in it and one of those tables is called bike trips and again the way I figured that out was by browsing at the google council uh console at the google platform console not the r studio console but I'm going to make a new connection right here I'm going to call it connection to a big query bikes and then I'm going to make a pointer specifically to the bike shares table and then I can query that with my collect statement and what this is going to do is it's literally going to bring back all of the data because I followed it with the collect statement so not very efficient but it's going to be fine as long as I don't hit my terabyte limit during this session for the month of February uh taking a little longer than I expected but I can get a progress bar right there and let's see what we brought back um I brought back oh it just said it was built it built me 136.31 um megabytes and I don't I don't know if um if that's new or not I haven't been paying attention to those things uh but it also tells me that I downloaded a million two rows right 1,264,000 um and 127 pages of data I'm not sure exactly how it's calculating pages but uh this is the data and you can see right here that that object just showed up as a data frame and that's because I use collect so let me expand that out to one screen and you can kind of get a sense of what's in here a bunch of data from a bike sharing company that tracks all of the bikes that they share I can use this glimpse command like I've used before to tell me that in this case the glimpse command is going to tell me how many rows there are because I pulled all the data down but it's also going to tell me that other things about the data that I want to know and then I want to start visualizing it and these visualizations aren't going to be any different from what you normally do because um I started by pulling all the data down with the collect state so all this particular I'm not sure that this is a particularly great visualization but um this is based on this link right here that is an article that sort of tries to answer the question how far is too far to bike to work and the basic answer is 10 minutes so I just did some box clocks where I sorted all the bike trips under 10 minutes and all the bike trips after 10 minutes and I kind of labeled them um and looked at like what was the average length of a trip under 10 minutes and before 10 minutes based on these different categories of passes that you could buy to the bike share company all right so that's not especially interesting because in the end because I use this collect statement uh it's not anything different than I would normally do if I just found the bike share data out on github and pulled it all down but it assumes that your computer is powerful enough to process a million two hundred thousand rows of data which I know mine is but if I was using one of those um bcms or the or one of the cloud based r studios maybe it wouldn't be uh so let's go then to that last bit that I mentioned earlier um oh let's see what I'm doing here oh this is where I'm doing a join and so I have this thing called bike share trips which I think I'm not certain I think I must have done this earlier and I didn't pay attention unless I class bike share trips yeah this is um this is a big query table so it's just it's not it's not anything more than a pointer to the austin bike share data without collect um and then I'm going to do a merge remotely at that data frame by calling one bit of data what I'm doing this is sort of a better example of how to push your computation off to the sql server what's happening here if I run these these rows right here is I am taking the top five percent of trips for the top five stations where uh a bike was at a start station right so in other words 72 000 trips started at the intersection of 21st street and speedway and 31 000 started at the convention center in fourth street right so that's the top five percent of all bike trips for the bike share company and what stations they're at all of that computation got translated into sql and took place at the sql server I'm going to do the same query for the end station so end station versus start station and then I'm going to merge them together so that I can get um so that I can look at a table side by side and I might choose to do some kind of visualization with that information but what I pulled back and what I sent over the wire was nine rows of data right I've somehow went through a million two rows of data and pulled back only nine rows that's what that's what db plier is really doing all right so um last bit uh here's a db connection to the new york tree's data set so this is what I started out um showing you and so I'm just going to make that connection to new york trees I'm going to give it my billing project id and uh once I make that connection I will specifically search the tree census for 2015 there were several tables there but I just want the most recent one that I could find and then I'm going to use these two new libraries db plot and leaflet um if you've been to some of um through his workshops you've had some background on leaflet and there's a recording of how drew was showing people how to use how to do visualizations with leaflet with basically where you use that to create a dynamic map and db plot is a special uh new library part of the tidyverse that is really designed to also push as much as the visualization computation as is possible off into the remote starter right so you're always limiting what data you're pulling down but let's get a quick sense because I don't know before we go much further well I should go in I should run both of those libraries in case they're not running oh I got it right here I want to see how big this data set is and not only how big is the data set but um what are the data types so I actually only got a count from one variable curb location if I go over here to this and I can find curb location wherever it is uh it's in here somewhere probably won't find it because it's too hard to see but um curb location is made up basically of two variables and I don't think there's any na because na didn't come back but it looks like there's about um 75 000 rows in the data set and or roughly 70 000 700 000 sorry that's what I meant to say um and I wanted this just deal with this part of the data set so I could pull that down locally 26 000 rows not so much the big deal but instead I'm going to show off this function of db plot that I mentioned where what it's going to do is it's going to rasterize the data which I can't exactly explain what rastering does but it's like histogram it puts it takes counts of your data and puts them into bids right and it's particularly useful for mapping but it's not only used for mapping I don't think and so I'm saying that I don't have a certain default resolution here at 30 and it's bringing me back a what looks sort of like a choropleth shaded regions where the lighter regions have more trees based on the categories right categories our curb location is offset from curb and the health of the tree is good and then rasterize that based on the long latitude and longitude which are which are variables in the remote data set right um you'll see that in just a second and it brings back this map which if you're familiar with New York this is actually fairly um explanatory but if you're not familiar with New York and we're going to do some more to make that clear it just looks like a bunch of funky little squares uh but if you're familiar with New York uh you know that that's Staten Island that's probably Manhattan on the left and Harlem up here and the Bronx right here uh Long Island right there but let's let's process this more so that it becomes a better map for us uh and but anyway the other way to read this is that there's a high concentration of healthy off curb trees here and a high concentration there and a high concentration there and there's probably no data for this section and I happen to know from looking at the map once I did more work uh what that region is I'll leave it as a cliffhanger so you can see but we're going to take that same data set and this time count to see how many data uh how many rows we're dealing with just because I'm curious and then the other thing we're going to do is we're going to use a db compute raster rather than db plot raster so that it will it will give us an opportunity to compute visualize to visualize locally and just do some more advanced uh visualization with db compute raster because db compute raster doesn't bring back an image it brings back data right brings back latitude longitude latitude and the number of instances in each one of the bins right that's what this column is so then we're going to further um tweak that by the way some of this code I borrowed from the guy who's developing this package his name is up here somewhere uh Edgar Ruiz so if you want to look at his video presentation you can you can get some more information about that but what we brought back it looks like 364 rows of data rather than the 700 000 so we did all of that compute over there just brought back 364 rows of data with latitude longitude and frequency number and then what we're going to do is we're going to use something called tidy evaluation to create a new function called size uh I won't go into this too much but it just means like like you can use the mutate function or the select function you can create a function called size it does some calculations to help us in this case what we're going to do is we're going to figure out the dimensions of eight uh not the dimensions but the coordinate corners of each one of the raster squares right we're going to do that computationally so we can do all of them at once and this is more of that uh where we're computing the size of the latitude distance and longitude distance that we can then create a new data frame called square or sq which I'll go ahead and display here and so now we have this table that we're going to then we're going to visualize the data with this information right so we have long one long two lat one latitude are the four corners of each square and then how big the square is going to be and then a sort of a percentage of the data that we're going to I think that's what we're going to use to visualize the the to color in the squares um so if we take all that if we take sq right there's the data frame right there and we send that to leaflet we're going to come back with a dynamic map that looks like this right which we can we can zoom in and zoom out this is something that Drew can tell you more about although you'll notice that the squares are not colored in right uh you can start to answer that question of what what is right there where we don't have any data but we'll again we'll leave a cliffhanger um because we're almost done this is good but it's not good enough because it doesn't give us the frequency information so I'm going to take that same leaflet map and uh that's the bulk of that is right here and I'm going to add a different uh background map and I'm going to uh do the cora plus style fill in of colors based on the of max variable which is right here that defines the opacity of the square right and when I run that which I don't have to because it's in my it's in my memory um I get this nice map and then I can see where the most trees are now the only downside of this particular map is that if there's a lot of trees like up here on the the upper west side uh there's so many trees that I can't really drill down so much to find out what's under that but where there are not a lot of trees it's easier to see so for example I had that flank spot in the map um over here and that is it turns out if you look on a different um with a different base map it's clearly labeled as rikers island right uh so there's the rikers island bridge which I've never been to and um the only thing I know probably is through movies uh rikers island is some kind of prison complex uh rehabilitation complex I couldn't give you the exact definition but I don't know if it's the case that there are no trees here I suppose we could switch to a satellite view or if they just didn't let people onto the island to do the tree counting uh and we could we could try and answer some of those questions but I found this I was just playing around with these different views um I found this particular section really interesting like why is that have so many off off the curb trees and there's a nature preserve right there anyway all of that was done just pointing out by only transferring 364 rows of the original 700 thousand rows so that's the idea of querying remote databases doing your computation remotely but pulling back the information locally just so you can visualize locally um that's pretty much everything I wanted to tell you about how this works I hope I told you the right stuff I hope I told you what you wanted to know uh you will get in your email a link to this feedback form that I'm going to put it in the chat right now and open it up for questions because I know I didn't take a lot of breaths so if you if you had a question I would love to try and answer it if you don't have a question that's totally fine looks like yohan's about to ask a question thanks john hi sure um this might be kind of a silly question um I'm new to all the cloud stuff so I'm not sure um but just to be clear um should we be able to run this this notebook exactly as it is or will we get an error if we try to use your credentials for the db conducts call that's definitely that's a great question um you will get an error if you use this billing value right here so that's the only thing that you should have to change now just just if you want I'll show you where that came from so you can figure out what to change it to um I do by the way let me zoom back out I have a file in here called your term which is really simplified and leaves some blanks for things that you should put in here rather than relying on what I put in but to go back to this statement this billing is really a project ID I'll go to my console that billing ID came from right here so when you when you create a new project it will give you an ID for your project you can call it anything you want like my first project and I think it I think when I did my first project it actually gave me a billing a randomly generated billing ID but you can change it and then you just need to you just need to use that value um in this section right here and then when you do that the whole thing should run if there are not other questions go ahead go ahead thanks I just had you were explaining when we are still on the server side and and when at what point we hop over to the client side and you explain that the collect function is one way to do that but then you also did some mapping totally in the cloud and understand that there's a little bit that needs to come to to my side to do the mapping but like for example I make every use of gg save if I do everything on the cloud side and then end with gg save what happens yeah so what should happen then is you should send just just the uh size of the file that you saved well I believe yeah you should get the file you should be able to save that file locally okay thank you uh yep but also this I don't know that I made this point super clear but this approach to querying remote databases we just used google big query as one example there are lots of data to query there but you may have access to a different relational database in the cloud and you can you can generalize what we've done I'd be happy to help you get started there we just need to know some odbc connection information to query other remote databases that are not part of google big query does that make sense yes and that would that would be a different a different our library um it should be the same I didn't I actually didn't draw the documentation like I had hoped to but db let's see that's not it um db.r studio um we loaded this library the dbi package which I'm pretty certain is the package that you're using to broker most of the odbc connections but there are different kinds of connections to make but with the dbi package or some of these other packages like you can connect to oracle servers msql servers the list goes on um redshift is the amazon version of of google big query right um I see that there's msql server here and I don't know if that's related to um the the microsoft version of google big query is something called uh is a product on top of uh azure right and the university has access to azure and I've done just enough research to be able to say that you should be able to apply azure databases to this as well so if you can get on the university's backbone you could kind of ignore all those all those comments I made conservatively about billing right because the university is probably not going to charge you for that okay well um I'm happy to stick around for a couple more minutes if there are other questions but I have told you everything that I hope to that I had hoped to accomplish and since this is the work the first workshop I've done um if you want to put anything in the feedback that would help me think about how to do this a second time I would love to get your feedback and thanks for your attention