 Welcome, everybody. This is Intra to R. We appreciate you coming on such a beautiful fall day. You know, the weather has suddenly turned very nice, and we had a good sign-up, so we're heartened to see a large number of people attending. John and I have been teaching Intra to R for probably about our sixth or seventh time at this point, at least. And I think our first question we always want to know is how many people in the room have used R before as part of your research? A few people. Okay. How many people have used Stata, SAS, some other stat packages? So a good mixture through there as well. Okay. What we are going to do is we're going to take a look at R using, this is in the course description, a method that's called the tidyverse. So it's going to take you through using R in a very particular way that we think is a little bit easier than using a traditional approach to base R, kind of going through the basics of the language. We definitely are not going to focus on all aspects of R, but we're glad to linger after the class and answer specific questions if we're not touching on a concept or something that you're interested in hearing about. So please do ask questions during the presentation if there are things that pop up that are unclear or you'd like to follow their explanation. We may delay some things to the end of the class. I think the other thing we're going to use which I will talk in more detail in a second about is R Markdown, which is actually not R per se, but a way of writing code inside of R that blends R code and English text together. We're using that mainly to make things a little bit easier to read because all of us can read human languages. And then inside the R code sometimes we, at least we have that fenced off where you wouldn't know that this is R code and it's a little bit different from what we would say to each other using a human natural language. So that will be another aspect of the class. This is part of about 25 to 27 workshops of the data and visualization services department in the library offers each semester. We encourage you to come to others. We have one more session on R later this week on Shiny that I'm teaching on Thursday. We have handouts in the back, but we are always very glad to see you. If you can't make a workshop, try to record in those workshops. I'm not sure if we're recording today. We have a session, this session actually has been recorded online. You can find those on our website. We can give you a link for that. I think the other thing I would say is we also do consultations here in Bostock Library. Our lab is on the first floor. We have a large data visualization space, 12 workstations. Come and see us. You can always email askdata.dutd.edu and we're glad to talk to you and email in person, whatever works best for your research situation. So please join us, which brings us to Intro to R. So as I said, we're going to learn today about using R with RStudio and I'll talk a little more about RStudio in a minute. We're going to focus especially on loading data in for analysis because for most people you won't be hand entering all the data you're looking at. You need some quick way to get the information that you want to look at into R. And finally, we're going to focus heavily on managing, cleaning and analyzing data because unfortunately or fortunately depending on your predilection, data wrangling and data management is often a large portion of getting ready for an analysis or a data visualization that you want to make. And we're going to try to show you some tools that make it a little bit easier and also kind of focus more on using human language, human verbs to kind of do those processes, make it more intuitive. We'll also answer questions about R if there are things that pop up along the way and we're also going to try to give some interesting previews of what you can do in R and Y. It's a little bit different than SAS data, some traditional statistical packages. I think the first thing I would say is we're going to implicitly focus on using data frames in R. This is just a rectangular form of data. We won't spend a lot of time really talking about it beyond that. But we can't go into data type concepts we're just going to leave it at that. We're mostly going to focus on numeric and text data there are other data types inside of R but we aren't going to worry so much about those today. And finally we're not going to worry so much about missing data we'll leave that for survey classes at the moment and other stack courses. But once again happy to talk about all those things if you would like. So why R? I think this is a question I often hear from people using data and other packages about I know these packages well, why would I consider shifting over? I would say besides the fact that R is a free programming language which is a very compelling reason. But R has a lot of capabilities beyond doing just base statistics. You can create data driven maps inside of R. It's a great source for making interactive websites if you want to look at that and indeed we'll be talking about that on Thirsty a little bit. You can use it for text and sentiment analysis more and more people are moving outside of numeric data and looking at other data types. R has extensions that will let you do that type of work inside of R. I think one of the more interesting things is more and more people are actually using R to do their academic writing as well as their analysis and coding. There's several packages now that let you use R as a word processor. You can use it to publish blog posts. You can use it to build websites. It's a pretty fully functional R studio and R together make a fairly fully functional research tool for research. I think another reason a lot of people think about using R is that the R community has been very busy trying to build extensions on to a lot of popular web services and tools. If you go on Github, if you go and look on Bio Conductor, several other academic data code sites, you're going to find a lot of linkages to popular web tools about making R work with things like leaflet. If you want to build web based maps you'll find connections to archives like fixed share, connections to the Google suite of services, Google Sheets. You name it. There's probably a connector that can link up the work you're doing in R along with another platform if you need to make those connections. And that's a fairly compelling reason if you already know R just to kind of use that knowledge to move into these other spaces. But first a little background about R. For people who are brand brand new, we often have a little bit of confusion about getting R installed on your computer and what do I need to use. R and R Studio are not the same thing. R is the programming language that we're going to be talking about and it is the basic thing that you would need to run R in any case on your computer, so you do have to download that. R Studio is an integrated development environment that is built around R, so you can't use R Studio without having the R programming language installed, but you can use R without R Studio. Although a lot of people are increasingly using R Studio for all the nice tools that come with it. We're going to focus mostly on R Studio today. I think the other thing we wanted to mention is you don't necessarily have to go through all the trouble of installing R in R Studio on your machines. OIT at Duke has several nice virtual computing systems available. You can explore that and actually just get a machine with this all pre-installed without having to worry about installation details. So that is definitely something we can talk a little bit about after class as well if that's of interest. So the R interface, what does it look like and kind of how do you work with the system. Basically when you open at four panes that you would normally see, at the top left-hand corner you're going to have a code editor and this is kind of a space where you are going to write out, sure. Since you mentioned that R and R Studio are different, do you use R Studio? Yes. This is the R Studio interface that we're going to be looking at today. Generally the default has four panes available. You can move these around based on your preferences, but out of the box it will tend to look something a little bit like this. At the bottom left-hand side of the screen, you have a console where you can enter one command at a time to R and it will reply back kind of like having a dialogue with the computer. If you want to enter more commands, several lines of code, maybe some human readable text as well, there's a code editor in the top left and it will allow you to keep entering line after line and then running everything that you've entered so far or making selections and running blocks. There's a box at the top right that gives you a little bit about the memory that's inside of the system or the environment so you can see the variables that are loaded, datasets, other objects you may be working with. At the bottom right you have a set of tabs that do kind of a lot of different things. You've got a file explorer which isn't open at the moment. You have something for viewing plots, if you're doing statistical graphics or visualizations. There's a list of packages which are added on pieces of code that make R do things that it wouldn't do normally that you get to add to, subtract, depending on your preferences. Some help which is very key because R has a lot of commands and there's a built-in system for looking at the syntax for those commands. What are the options that I have to use them? And finally there's a viewer that's kind of a multi-purpose window on different types of visualizations, interactive websites. R uses that to kind of pipe out results that are graphical in nature a lot of the time. So it's also convenient about having code in the output or at least the viewer output close by. I think the first thing I would say is that R tries to provide you with a lot of resources for getting help and trying to navigate your way around both the interface and the code that you're going to be typing. Indeed, one of the things we hear most about R is that the syntax is confusing and it takes a while to learn the language. R Studio has tried to make that a little bit easier by having a help menu at the top where you can get information that will give you everything from documentation about keyboard shortcuts if you're not a person that likes to write long commands. There are a set of cheat sheets that are hiding inside of the system that are extremely useful that will show you the most frequently used commands for different types of functions whether that's data wrangling, the interface for R Studio I was just talking about, there's a sheet for that. There's one for graphics, one for building web interfaces. It's definitely worth exploring. We'd recommend that. And in general R Studio Group has also tried to put a lot of webinars and other guides online to make it a little bit easier to use. I think we're going to focus a little bit now on writing code in R, kind of the general process that we're going to follow. You have several different ways to work inside of R to write code. We're going to focus on using R Markdown today in class which is one particular way. And I'm going to break out of the PowerPoint presentation for a second and actually just go through an example to kind of show you how this works. If I can get away from Chrome. There we go. So I have my display if you open R Studio on your machine. So I've got the code editor enlarged a little too large. Actually I'm in the console, sorry. There's my code editor. And I am going to start a new file. And I have a lot of options but for class today we're only going to focus on the R Markdown option which is about the third in the file new file list. You want folks to follow along? You can't. It's up to you. We're welcome to follow along if you would like. We'll definitely go through this one more time. I'm not going to name the file. I just want to show kind of how this works. And so generally when you open this file R Studio is going to give you a document that is a template or at least a preview of how an R Markdown document would look like. And the general format is things that are written inside of human language or English here are going to be just typed in. Just like you can read them in any word processor. You are going to see these kind of funny codes around some of the text. These are text formatting options which we're not going to worry about that much today but there are ways using special symbols, pound signs, asterisks to make code bold and to make it an italics. Not so important for today's class. The other thing you're going to see in this document is you're going to see a lot of sections where there are these tick marks a set of brackets and then R code that follows. And these are R code blocks which we will be using quite a bit today. It's a little bit hard for people to think it's tricky to find where the back tick key is. It's at the top left hand corner of your keyboard. If you had to enter that every time that you wanted to enter R code you would probably get tired of typing it in. So there is a keyboard shortcut which we probably won't focus a lot on today. But you can go to the top right hand side of the screen. And there is a let me go to the very bottom of this. So I've gone down to some clear space. Right here at the top under the green where it says insert at the top of the screen. Plus C, there is a R is the very first option. And that will give you an R code block. For those of you following along, is that working for everybody? Okay, John can, or a question or help. There is a keyboard shortcut for that. It's control alt I. If you're a Mac user control option which I tend to use the Mac more frequently. But that will give you an R code block. What this section right here is, this brackets with the R inside of it, is this is not unique to R. You can also put in Python, other languages. This little bracket just tells the compiler, hey, we're using R here and we're not using English anymore. We're switching over to computer code. And so we have the two things together. And the great thing about running these together, I'm not talking a lot about the header. Don't worry about the things at the top. Not so important for the class. Is that once we're using a document like this, not only can I go to the right hand side of any of these blocks and click that little green arrow and run it. And get results inside of my code editor, which is pretty neat. I can also, or I can keep going for the document and looking at what the outcomes are going to be. I can also go to the section called knit at the top. And knit is a way of, it wants to know where should I save this? And I'll just say, just save it in a file called test. And then when I click knit, and you can knit to HTML, PDF or Word, it's very flexible about the output formats. Another nice feature of R. R brings together my brings together my English language writing and my R code in the same space. I just make one comment about what Joel is seeing at the top as a menu system. Because we're using R to also generate the website guide for this workshop. But in most cases, so that's being generated by something called site YAML. But in most cases, you won't get all those. That's going to say, your screen probably is not showing the intro to R part, but you should be getting the code and the rest of it should be showing up on your display. And that's the normal way it should look. But so we are using our markdown today, mainly because it gives us a really nice way to talk about, to explicitly talk about R using English language around these blocks. You should be using our markdown. The reason we would argue is it gives you a really great way of documenting what you're doing in your code at the time that you're working on it. So in a couple of weeks past, and you're coming back and looking at an analysis, you have a place not only in code comments, but outside of the code where you can actually write down what you were doing at the time of the analysis, or if you're working with a research team, it's a great way of kind of passing notes to each other about how your research was going. So those are some of our, at least our kind of little spiel for our markdown today. And we will won't go into a lot more, but we'll definitely use it for a lot more of the examples we're using in class. That kind of brings us to the section on files and projects in our data and visualization services. Does a lot of work on data management and helping people organize their data and work with teams, both research data and R is actually a really great tool for helping people organize their work and share it with other people. One way that R does this really well is it has a built in system for managing research projects and for a lot of people they think well I wouldn't use a project unless there are multiple people, but often a project is anything you're working on that you want to keep organizing for yourself. So a good way to do this is to use the project controls that are at the top right hand side of RStudio. We're going to use them today as a way of kind of bringing all the class materials so you'll get a copy of the class materials here and later if you want to install these on your own machine you can just follow these steps and all the code and the slides, everything we're doing will be downloaded to your machine. So the first thing I would like to do is for you to go to a web browser and enter this address, we didn't get it linked into the Intra to R page, the way where it was quickly easy to find. This is going to take you to a repository or GitHub repository that has all of the training materials for this class as well as the code that we're using and you don't have to do anything other than just go to this page. We'll come back to this in a second. It should look something like this. I'll go back to the code. There we go. Success. Okay. What we're going to do is I'm going to leave the slides and actually join you. Close this project. At the top right where it says project, click on project. We're going to start a new project. This class is our project. Instead of just creating a new project on our machine, we're going to pull this down off of the web off of a version control repository which is what GitHub is. So we're going to click version control and we are using Git which we teach in another class but all you have to know today is Git is the right choice. And the repository URL they're asking for on that first link is hiding on the web page. So if you go back to Chrome so I can get back to Chrome or to the web browser you'll see a green button right in the middle. If you click on that button there's going to be a URL there which starts with HTTPS. If you click on that it should highlight the entire thing or just make sure it's all highlighted. Right click with your mouse using the right mouse button copy and then paste that in the repository URL. I'm using the right click and paste. I couldn't get my keyboard wouldn't let me control V for some reason. I don't know why. So far so good. Okay. Project directory name. I'm going to call it intro to R but it is whatever you would like to call it is fine. That is up to you. Whatever project name is memorable. And to create the project as a subdirectory of I don't think we're going to worry about that either. Let's just leave that alone. Leave the tilde. We're just going to create the project. Your machine won't say this because I did this at the start of class and it's saying hey you've already done this once. So it's telling R is very literal and it's saying you can't overwrite the directory you've already created. But you search these things flying by on the screen as it downloads all the code. And what you should see at the end I'm going to go back to mine is under the file list at the bottom right you should now have a lot of files located in that directory. And that is what we're going to be using for the instruction and the exercises. So stay tuned and John will talk more about loading that. Before... Yes. But just to close very quickly. So let me, in closing I'd like to talk about places to go to get our help. Besides askdata at Duke.edu we'd like to recommend two sources that we found very useful as we tried to kind of build out some of our R services. One is R for Data Science by Hadley Wickham and Garrett Grollmond. This is a book that is also available online at the link that is here or you can just Google R for Data Science. Hadley Wickham is kind of a lum... is a luminary, not kind of a luminary. In the R field he's written numerous packages. He's the author of many of the tidyverse packages. This is just a really clear guide to kind of building a workflow inside of R and basic techniques that people use when moving from data to analysis and using data visualization effectively. Indeed, Hadley wrote one of the premier data visualization packages for R. A second source that I tend to find is quite good is Robert Kopekoff's R in Action. He's coming a little bit more from the social science. He's a stats professor and social scientist I believe, but a little more social science point of view. It's a very good reference to base R as well. Both are good guides. He has a website as well. Take a look and see what works for you. That is what I would recommend on that one. Getting help online. I'm running a little bit long but there are multiple ways to search. We would definitely recommend if you're using Google to add R stats or R to your search. One of the jokes about R is it's so easy you can Google everything, but finding R materials on Google has gotten a little bit rough because disambiguating that it's R specifically can be a little bit tricky. There is a search engine called rseq.org if you go to that. It kind of takes care of that trouble it only searches for R topics or it's optimized for that. So that's pretty handy. A lot of people enjoy using Stack Overflow and adding R in brackets to the search on that site and that will give you you'll find a lot of people who've had difficulty using R for a particular task and code suggestions and not a bad source for getting ideas for different types of coding challenges. Another source that we've also had a lot of people who have found this very useful is Jenny Bryan is a statistics professor at the University of British Columbia who's recently joined the R Studio recently joined an official position at R Studio. That 545 class uses R as an integral part of her statistics course and she has a very academic approach, academic in a good way to using R and kind of some of the challenges that researchers face when trying to get R to do academic tasks. So it's another, her website is another one worth looking at if you're bumping into different types of problems getting R to kind of behave the way you need it to behave. I think finally help at the command line you can always type in help and then in parentheses type in function names to get information. If you're not sure about what the topic is exactly, two question marks at the command line followed by the name of what you're looking for does a bit of a fuzzier search, searches a few more places. And with that I'm going to transition over to John who's going to start talking about data wrangling inside of R. So we have several sections of this workshop of the sub elements of today's workshop that we're going to learn different parts. The first part about just loading data and it could be very slow perhaps but it'll help us all get to the same level so we can do more of the wrangling later. Just a couple comments, everybody who's registered or even if you're on the wait list you're going to get, if I have your email address you're going to get an invitation to use base camp for the next, not base camp, base camp, no not base camp, data camp, thank you, data camp for the next six months or really yeah I think it's six months they have a funny way of defining the time frame. And that's an interactive online learning site that usually charges people by the month but they allow you to learn both R code and they're particularly good with R and Python. Anyway I will send you all an invitation and there's a way for you to back up what we tell you today in practice because R is a command driven language so the more you use it the more you're going to learn. So if you see that in your email don't be surprised. I'm not going to give you any assignments on that so it may initially look like well what in the world do I do here? Just go to the courses and you can take any of them you can do the Python ones if you want to it doesn't matter. I found that to be a really handy way to learn more about R. Second one to mention that Joel's slides which has some very handy information when you opened the Git repository and it downloaded all those slides in the there should be a directory in here down here called slides and so you'll see that you already have downloaded his slides to your local workstation. If you do that same thing back on your laptop or wherever you do your work you'll get his slides and you can review them. And we mentioned that this is likely going to be recorded so you can share all of this information or if you're just learning you can speed up the recording and do it again so you get to the good part that you couldn't quite remember if that happens to you. It happens to me all the time. Alright so I just wanted to reiterate that basically everything we're doing here all of these guides this was all done in R and it was done in R using R Markdown which Joel mentioned is a way to do literate coding which means that you can intersperse your English documentation with your code which is really handy because six months down the road you're the person you need to talk to most. Why did I do that transformation? Why did I use that particular package? What do my results mean to me when I did them? But you can also use this literate programming method to generate all kinds of derivative outputs. So just the code or a Microsoft Word document or a web page or whatever. It's a good way to enforce your reproducibility and if you have been following science news you know that we have a lot of information bombarding us with the crisis of science being that things can't be reproduced. This is a great way to document all of your work so that it can be reproduced. Right so the first thing we're going to do here is I'm going to run through this part called loading data and I'm going to work sort of in two environments. One is this web page and the other is over here in R. I'm going to start and I would recommend you can you don't have to we're going to get to an interactive exercise part in a minute so you're welcome to go along with me but you don't have to. I'm going to open this file called intro2r.rmd and what I'm going to do for the most part hopefully to make it visible is I'm going to make this source console the largest. I wouldn't recommend that you necessarily do that but it will allow me to make things large and you'll be able to see more of what's happening here. I'm going to lay out the appearance so I'm going to make it a little bit larger. Okay so what's going on here? I have this YAML header which is defining how the document is going to get printed out. Technically that YAML header does not I think the only thing that's really necessary is the title. Then some structure that the first section is about data management and then some information about the tidyverse because we're going to teach a particular method called tidyverse method. Just a little bit more background. R was developed out of a statistical language called S. This was done in the 70's for sure maybe even the 60's and the way we use data these days is a little different than how we used to and the tidyverse is a way to sort of kind of put modernization on top of your R tools so that you can manipulate your data in a more logical fashion both for your own use but also for reproducibility. So tidyverse is a concept but it's also kind of like a giant mega package or a meta package or a collection of packages so every one of these words here is the name of a package used for different things for example lubricate is a package that you can download to manage dates if you've ever worked with time dates can get really hairy. And all of these packages have a different function and the nice thing about remembering just one tidyverse is that you don't have to necessarily like one of the big questions when I start out with R is which one of those packages should I really be using how can I be, how can I work as quickly as possible and my suggestion to you would be use tidyverse and the rest of it will come to you the more you work on your tidyverse is this collection of packages we will mostly be using reader, tibble, the plier and the rest of them we probably won't even really notice. So how do you load them? Well you use this library command and then you identify the package. So let me go back here and just real quickly show you my four quadrant view. Joel mentioned that there's a packages button when you're back at your home screen, when your home location you can click on packages and use this install button and install tidyverse and it will bring down all those packages at once. Unfortunately we've already got that done. So in order to tell R that I want to so you install only once on any new computer. You install all of the tidyverse packages. But every time you open R after turning it off you have to tell R, even in R studio, okay I want to use these packages. You only have to install once but you have to tell it every single time you open it. And the way you do that is with these commands library. So in this case I'm going to tell it I want to use the tidyverse world inside of this code chunk and I want to also use an onboard set of data sets that's called data sets. It's also a package and we're only going to use one data set out of there. So if you do this with me click on if you're open to intro2R.RMD click on the code chunk for this and two things will happen. It's possible that usually I'm expecting kind of a message to drop down here and it's just a system message saying whether or not there were errors with loading the package. But I think Joel may have already done that so I'm not getting that system message. No worries. Scroll down a little bit farther. I've got some more text explaining what I'm doing. I'm using the MT cars data set. I can use this command right here to get the code book and more information about MT cars. So let me just go back to the four quadrant view. I could type in question mark MT cars right here and hit enter. And it's going to spawn or I can just go over to the help page and type MT cars. And in this case it's my code book. It's also where you get information about the functions that we're going to use. Hopefully some of this will become more clear as I'm moving along. Alright. One of the things that is important with reproducibility is that you be able to document how did you load in the data that you've loaded in. And the short answer is you would do it like this. But I also want to note that there are many ways to load in data. So I want to show you the command line way and I also want to show you the sort of info wizard way because the info wizard way is a little bit requires less memorization. So the function is read underscore CSV. It's one of the tidyverse functions. And then what I'm doing is I'm identifying the path to my data relative path. So once again let me go to the four quadrant view. Over here in my files tab I have a data directory. And in there I have cars.csv file. So I'm using this function I'm identifying the path to the data and I'm using this assignment variable which looks like a less than followed by a dash and there's a keystroke for that which I'll demonstrate in a minute. I'm assigning that data read in into a data object that I'm calling cars. Okay. So if I click on this green arrow something will show up here in this upper quadrant if I'm in the environment quadrant and it will tell me that I loaded an object. The object has a name cars because I gave it that name and it has 32 observations and 12 variables and if I want to look at it I can actually click this little grid right here. I think you can all see that. And I just want to note that that is for my personal viewing I'm now looking at the data. It tells me the same thing. There are 32 entries. I'm looking at all the columns. That's all great. But I've moved into a different tab. I was in this intro to our literate code page, our markdown page. This is just so I can kind of get this quick sense of well what does this data look like? Alright I'm going to go back here and if you'll do this with me this next code chunk is really probably not worth explaining but there are some differences in how we deal with data structures and what I wanted to demonstrate here is the simplicity of doing a reproducible data load but I really want to do it this way. Let me just also mention to you because I said a minute ago the easiest way to load data is to go over here to the environment and click on import data set. So if I do that let's say I want to recreate or get this bit of code. Click on import data set. I know somewhere that I can go hunting on my local file system or even remote. I have a CSV file and if I click browse it's the same view of from the file system in sort of inside of data. Here is the cars CSV file and if I click open one of the things that's happening is it's showing me right here the actual code that I would need to transfer over into my code chunk. The reason why I would do that, I mean I can just click import right now and it's going to happen but the reason why I would copy that one line of code is because it's more reproducible for the future otherwise I have to keep doing this import every single time. Alright but we've already done it so I wanted you to be aware of it. Alright now we're looking at this code just a second ago and that's all very nice and good but if we were producing output we can't share this tab very easily with some other remote user. But what you can do is just call the object name. That's the object name that we loaded up here in line 45. And when we run that code chunk, R will automatically let me make this larger. R will automatically put it in a variable table. So I've got pages one through four and I can click next, next, next and scroll through the data. I can look at the data variable column names. And below that, immediately below that there's a little bit that tells me what the data type is. I'll explain that in a minute. And over here on the right hand side I've got, it looks like 10 columns displayed out of 12. So if I click on this little arrow I can see the remaining columns. One of the values of that is if you are generating output that you want to share with other people that same little data window can show up in your derivative output. So I didn't do any extra coding in order for this to show up in my in this case HTML report. Alright, one little more bit. You don't have to memorize this at all. Joel said we're going to use primarily data frames and that's true. There's sort of two kinds of data frames. There's the old school data frame and then there's the tidyverse data frame which is generally referred to as a tibble. Think of it as the same for the time being. We're not going to use these but you'll run across them eventually lists and matrices. And then the other thing is a vector. The vector really you could think of as simply a column of variables. They always have to have the same data type. So if it's a character that's a character vector called make model that's a numeric vector called NPG that column has to all have the same data type. So I could actually put the letters into this numeric data type. For the most part you're not going to have to worry about that too much especially starting out but if you're used to doing a lot of statistical analysis I think some of that will make sense to you. It's worth noting that sometimes you want to know well what is the data structure that I'm working with. So we loaded the data structure if I just use the class function and run that. It tells me right here, this is why we call it a tibble because there's no vows there at TBL if you were going to try and pronounce it perhaps pronounce it tibble but it's a tibble data frame. One other useful data structure command is the glimpse command If I run that it tells me some other useful things. I have 32 observations, I have 12 variables the variables are listed here. This is the initial glimpse of the data set itself and then in this column is the data type. So we know that the first data type is a character and everything else is listed as a double which is a double integer it's a numeric data type R basically has two numeric data types that has integers and doubles which are approximations and floating points. You don't have to know, you don't have to assign an advance, R is just going to handle that. So if you're doing some kind of advanced mathematics or semi advanced mathematics where it becomes an issue you can manipulate that but we're not going to do that today. But the main data types are character, so alpha numeric, numeric, logical so true and false, and then factors. So you might have categories of things. In today's workshop we're only going to use numeric and character. But just to pick up we used class to find out the type of the data structure for the car's object and I mentioned that each one of these columns is a vector. So we can also use class as another way to identify the data type of one of those vectors. So for example, class data frame, dollar string, variable name is how you would do that. I've got two commands stacked on top of each other and I'm going to get two responses when I run this. So that's just telling me that the make model is a character data type and the MPG is numeric. Soon, I promise you soon we're going to move beyond this but it's helpful to know these things. Two other helpful commands table bars is a real quick way to get just the column header names. And they're returned in a vector and all these numbers are telling you that this is the first element and we start counting with one as opposed to some programming languages to start counting at zero. This is the eighth vector and that number is relative to the width of the screen. It could be anything. But you get the sense that there's 8, 9, 10, 11, 12. There's 12 data types in that vector. Structure does something similar to glimpse. It's a little more old school. It's a little less lovely particularly when you get into more advanced data structures. But they all tell you things that you might want to know. Like there's the 32 observations, the variables, glimpse of the screen, glimpse of the character type, or the data type and the glimpse of the variable names. Alright. So what we're going to do now is we're going to move into a hands on portion. Joe and I can roam around. We'll take at least 5 minutes, maybe 10. And we are going to do, what I want you to do is I want you to go back to this webpage if you haven't opened. And if you don't, let me know or let Joe or I know. We're going to go to exercise one. And it will tell you everything you need to know or should. You can take an interactive quiz if you're done early, which is basically just a way to reiterate or confirm what you're learning. But here are the directions. And here's the exercise one. And it's going to tell you to open up a new art notebook and start loading in some data and answering some questions about the data that you're loading in. After about 5 or 10 minutes, I will show you exactly how I did all that. So there's no don't worry if you get it right or wrong. It doesn't really matter. We're just trying to all get up to speed so we can manipulate our data. We'll send a quick note to Paul. There's a machine that's not working. Exactly. Well, or the new ones, because you're always trying to under, I think, given that it's not really, it's brainwashing. Let's see. Let's try this. Yeah, I think probably just, oh no. So you may have to reload some data to see if that works. So yeah, rerun new stuff. Yeah, there we go. Good. Did you try it? Oh, sorry. Sorry. Oh, so I tried that. Don't worry about the tidy first right now. That was my insight. We have slightly different errors too. Oh, kind of, yes. Well, I know what the problem I had to say to you is this is an issue that we're all going to have. Because you're in the data. So that should be inside of quotes. Well, I noticed that, and I didn't cover it, but it shouldn't actually make a difference, because technically an equal sign isn't. No one uses it for some reasons that are not working on it. But it's technically valid. That worked. I don't know, I'd have to look at that closer. For the time being, let's go as equals. Oh. You got it working. Okay. Yes. What happens when you're going to say, oh, I'll do that. Yep, that's it. You got it. I made that clock go off early, so don't feel bad if you want to keep working. I mentioned we would go back to 10 minutes until people want to go longer than 10. I'm not hearing anything. People want to stop now and have me go through what we just did. I'm getting a couple nods on that. Okay. I'll just remind you that you can do all this stuff back at your desk if you want to take time to think through what's going on. Right, so Exercise 1 says from the file menu I'm going to create a new notebook, so I'm going to do that. File, new file, our notebook. Let me make this all full size so it's big. So down here at line 18, I'm going to put in a Ctrl I. Alt Ctrl I, I think, is, I don't actually, I don't even know the command because I do it so often it just comes out of me. There it is right here. Ctrl Alt I actually comes in, documented every time you open up a new notebook. And then load the library, tidyverse. So I'm going to do that simply by saying library, tidyverse and I'm going to scroll down here and hit tab so I can autocomplete so I don't have to type it all and I'll click that green arrow and tidyverse is running where this script knows about tidyverse, which is good. Let me undo that and let me try and move that up a little higher so you guys can see what I'm doing. Alright, ready to make a new code chunk and we're going to do it just like we did a second ago. So the first question is based on what you've seen in class, can you load the Broadhead Center data into an object called Broadhead. So I mentioned there's a number of ways to do this. Let me kind of show you that I think the most direct is if I add, if I create a code chunk for myself, this is what I do more often than not because I find the read in comment to be a little verbose and hard to write properly. So I usually open a code chunk and I go over here to import data set and I'm going to say from CSV and I'm going to browse the data in which case is Broadhead Center and the only part of this that's not right at the moment is that the direction say to put it in an object called Broadhead, it gave me so I'm going to copy just that one line. I could just go ahead and click import but again just for the sake of reproducibility if I paste this over to here, this should work just fine perhaps it's stylistic but I would feel a bit better if we took out as much of the hard path as possible to have it be as relative as possible but it would have worked either way and when I click on this code chunk I should over here in my environment variable get a new object name called Broadhead and it will actually tell me several things that I could use to answer the other questions but I'll show you the commands as well but here I have the Broadhead object it has 59 observations it has 7 variables if I click on that grid I can view that data because it opens up a new view and if I expand this little blue triangle gives me some information that I can get if I type some other commands like the data types and object names and things. Alright so take a look at the structure of Broadhead object how many observation rows are there so again that information is right here but an easy way to do that would be to type glimpse I hit tab to autocomplete and then the name of my object that I had just loaded Climps Broadhead and when I run that there is my answer 59 observations 7 variables so how many observations, how many variables how many of those variables are numeric now I could go through some really long process to count up the number of data types that are integers but it's pretty simple to eyeball this since it's right here in front of us there are two numeric data types everything else above that as you can see is character okay so now we have two data objects loaded they're both tibles or data frames and each one of those is composed of a set of columns or vectors yes sir vector oh so you probably did almost certainly up here where you wrote read in the data do you have read.csv instead of read underscore so that is to go back to discussing the tidyverse and mentioning that the way we manipulate data has changed a little bit over 30 40 years back in the old days what we might call base R assume that any time you pulled in string data you wanted it to by default be factor data okay if you're used to working with factors this probably all makes sense to you and if you've never worked with factors you're probably wondering what I'm talking about mostly as we move forward most people are not all that interested in factors so the tidyverse way to read in data by default turns that off the read.csv which is the base R way by default has it on and if that if you're wondering what I just said you can ignore it because you're probably not you're probably not going to use factors anytime soon and if you want to know more about that I'd be happy to talk to you later I have open office hours every Wednesday one to three in the data lab on the first floor of boss doc and you can just walk in alright so let's move into data wrangling I'm going to go back to my intro to R screen and demonstrate a little bit more in this case we're going to use so one part of the tidyverse is this wonderful library called Diplier or some people call Diplier R I don't know how you pronounce it I call it Diplier I'm sure there's not a proper way but it's this wonderful package that is designed solely for wrangling data into different shapes sometimes you want to split columns apart sometimes you want to bring them together this is common stuff that happens all the time that if you can't do it you can't move forward so we're going to talk about it I have it under data wrangling and there's really a set of basically tidyverse or Diplier part of tidyverse is a set of roughly five English language like verbs that allow you to do certain things with your data so the first thing we would do is we would load the library tidyverse and then let's talk about the range a range is a way to sort your data so let me first let's look at our cars data over here in this grid for a second the nice thing about this grid is that I can click any one of these header names and sort it and that's a very familiar function to you if you've used the internet at all and sort all my data by miles per gallon and if I click it again it's in descending order versus ascending order the downside of this is that this particular grid is not easy to share with other people there's no command I can write that says sort this table in that grid but there are commands where you can force the sorting and in Diplier that command is called a range and the other nice thing is that with a range you can sort I'm sorry I'm lost here a little bit in this I can only sort by miles per gallon so if miles per gallon are equal for example down here in this line there are two cars that get a miles per gallon of 21.4 I can't sort those range command I can sort to my heart's delight and you'll see I'm doing it here so the first thing I'm doing let me mention another reason to use tidyverse is this this is a pipe character and we will use it more and more you can use a keystroke to make it appear so you don't have to type percent greater than percent but it's a basically if you think of this in terms of English this is a way to chain your whole process together so first I want to call my cars object then I want to arrange it and you'll see that the more we do this we'll start chaining on more commands which is fairly common in the Unix world to use a pipe and chain other commands and it's very handy for reproducibility so you can think of this if you translate that in your head you can think of that symbol to be English word and then I'll start with my cars object and then arrange it arrange it first by cylinder then subarrange by miles per gallon in descending order because the default is ascending and then subarrange that if there's a conflict subarrange in descending order horsepower or HP so when I click this code chunk you can see it in action the first arrangement is by cylinder so now I have all I have arranged by cylinder I don't know but if I scroll through this next next next you'll see it goes from four cylinder to six cylinder to eight cylinder and then you'll see where there are conflicts let me see if I can find or maybe I should say equals rather than equalities rather than conflicts so in this case the 30.4 and 30.4 miles per gallon subarranged in descending order sorry cylinders are all the same they're equivalent subarranged then by miles per gallon descending order so I have the highest mile per gallon for cylinder cars first and then where there is equality I'm subarranging by horsepower so I can go over here and you'll see that 113 horsepower cars listed before the 52 okay so that's the first verb very handy basically sorting and the function is called per range I slipped in a few base R commands descending because you can see that it makes it more useful yes sir it doesn't change I guess it's for display only I think is the answer so it will work when you create a word or PDF or HTML derivative but it should not have any effect on the base table yes that's a great that's a great point at the moment we have not actually changed cars at all but if we wanted to keep this particular sort order we could add in an assignment variable we could call it sorted cars sorry about the all caps and put in my assignment variable and then I would have the tables are basically identical only difference being in this case how it's sorted I'm going to take that out but I should actually skip this step on the slides how many people are brand new to R and once again in the room okay let's slip one point yeah going back to what you were showing I just wanted to mention one thing so John was talking about how do you assign values to variables inside of R so you want to save something a little bit strange and that assignment so sorted I think is what John was using if you use R a lot this doesn't seem weird but if you were coming from any other language it's like what is that there is a convention in R to use an arrow and a dash behind it when you're assigning values to variables and so that's just an R thing you'll see it in code quite a bit or other people's code quite a bit there is a keyboard shortcut for that so you can type both every time but that's what's going on with that most of the things that are going sorry if you're piping unless you're making an assignment it's all happening in memory that is worth noting so when you go back later to this cars data frame all these changes that come after the pipes are not preserved inside of the pipes they're just coming right to the output which is great for exploratory analysis but if you need to go back or have that calculation for later let's save it out I think the other thing I would say about R which I think we're going to come to in a second is this is assignment and I will point this out or John will if you are testing to see if something is equal it's two equal signs inside of R which is true with a lot of other programming languages as well so keeping those two things distinct will help you either assignment or equality and I'll let John go back I don't want to descend too far into that but one reason why that's important is because there are actually five different assignment characters and one of them which is out of favor is the equal sign I could do this but convention would suggest that we do it this way there is some directionality to this you don't really need to know that there are five but you do need to know that generally speaking the vast majority, maybe almost everybody except for a few people who are unconventional would use this assignment method so I encourage you to use that same method you will see later that we do use equal in this particular context and then there's the double equal alright so that was the arrange verb which you'll see we have preserved and let's introduce another verb which is select and select is the verb to identify what columns we might want to keep so here we arrange the whole table by three variables maybe we only wanted to keep some of those variables and so we can use that select column and identify the variables we want to keep in this case we're keeping four so when I run that command the smaller table something that I could present more easily in the case where maybe nobody was interested in displacement okay now so what we've done two things we've done arranging by a variable we have done selecting of that variable we can also select rows in which case the command is called filter notice that in the cylinder factor or variable we have 32 rows and as we scroll through it we have some six cylinder variables and some values and some eight cylinder values if we only wanted to look at the six cylinder values we would use that double equal sign that you'll just mentioned that's the variable name filter is the function we started out by calling the cars object and then we assigned it you're sort of starting in the middle mostly reading to the right and then you kind of come back and preserve that and then you'll notice down here I'm calling that variable that I just assigned mostly because I want to see it okay if I just run notice three things I'm going to run different parts of this code and you can see what's highlighted if I just run this part it saves it and displays it but it doesn't I'm sorry it displays it but it doesn't save it if I add the assignment it will assign it to something called six sills but it doesn't display it so if I simply add that function by calling that object that I just created then I have my seven rows of cars that all have six cylinders okay it's actually more functional very very very handy command called mutate and mutate is essentially generating a new column or a new variable and so what I want to do in this case is I want to create a new variable called not by the way super knowledgeable about cars so I created this mathematical method here that I don't know that it has any meaning whatsoever okay displacement variable and I divided displacement by weight weight in I think thousands of pounds right so I'm going to call that disk weight and I'm using in this case the convention here you really don't have to remember this the convention here is to use an equal sign so disk divided by weight is going to be a new column called disk weight that's all done with this function called mutate and then I'm going to pipe that and select the same variables I had selected before as well as the one that I just created disk weight right so if I run this the only difference between this and the one before is that I have five variables as opposed to oh this one has all the variables sorry so I have my mutated column right there now the last function probably the most useful but the easiest way to describe it is actually to show you the base R command first because it's a little more concise so I have my six sills data frame and I want to count notice if I look at my data right here I have several variables that are 110 horsepower and I have several that are 123 so I have some repeating rows and I just want to count that like how many rows of each do I have I could use the count command after I pipe out from six sills and I'm counting the HP variable and so when I run that get this nice little table of the value of the same in HP and how many the frequency of those so that's so easy to see that it makes it easier to explain that actually in a tidy verse way you do it a little bit more verbose by using the group by command first and the reason why is first you want to identify the group that you want to do and then you can have with summarize you can generate all kinds of other functions so you'll see that I'm running the mean function, the minimum function and the max function and each one of those as well as the count function and each one of those I'm giving a variable name so that after I group by horsepower then I'm generating my mathematical summaries and count basically when we go up here and I show you what I described first as far as I know you can't generate multiple columns actually I've never tried but I do know that this is the official tidy verse way and it has the added benefit of being able to operate with multiple functions so we kind of been putting this all together all along but just to put it under the context of putting it all together in this particular function I'm going to call cars I'm going to filter this statement right here is equivalent to saying give me all the four and six sorry yeah all the four and six cylinder cars all that are greater than or equal to four although they're less than or equal to six selecting certain variable names make model as well as here's a little shortcut mpg through weight right so what does that mean if I go back up here here's mpg all the way through to weight so it saves me a little step of typing and then I'm going to add in my mathematical calculation and then do my sorting and I have this nice table here so having done that I want to impress upon you that Joel I've out and heard him say I hope I'm not stealing your thought I've heard him say you might be wondering why would you do any of this when excel is so easy and it is sort of true that excel is so easy but the real the answer that I think Joel and I would give is that excel isn't like so far from being a reproducible program and so easy to get lost in all of these crazy formulas that are not only hard to write but nearly impossible to preserve and so while this might feel a little verbose at the moment the more you work with our the more you find that this as an ecosystem is both more reproducible and becomes a language that you just the more you work with it you won't seem so hard to deal with so let me try and suggest to you not to get too depressed I don't know frustrated with R at the moment and we'll move on here now to exercise 2 again there's an interactive quiz if you want to take it the whole point of exercise 2 is to practice with these 5 functions and I would recommend that you do this work in the same R notebook that you had used in exercise 1 so you don't necessarily need to create a new notebook and their answers down at the bottom so I'm going to set my alarm again for 7 minutes and then we'll figure out what we're going to do next no absolutely take a break if you want to and I'll be right with you I also want to mention that I have some feedback forms here that I'll be handing out towards the end but if you for some reason have to go we love getting feedback we do this workshop and sometimes we may forget something so please give us any kind of feedback that you like so what that's doing is it's grouping by the HP so that you can then do other summaries if you scroll up you'll notice here there's 3 rows and then we want some with the main of those 3 volumes so it feels perhaps a little verbose at that point it's all with the intention of being very clear about the process so in your I think it's click on untitled right but the grid is like click on untitled oh I'm going to release oh you deleted it I probably shouldn't mention that but it's okay because you can do this again just by reloading and that's it these questions I'll have to do with if you're comfortable with these rooms so we're not going to actually plan on ending there people can go wherever they want but we've got 2 more segments 1 on visualizations so just basic charting and 1 on mapping and then we'll I did by the way get a different count of heads in the room versus people who signed in so if you happen to be one of the people left who didn't sign in on the sheet please do because we really like to get that information so okay so let's take a pause how many people want to like extend for 5 minutes couple just keep on working there so you're not so well right and so it's like I expect it's going to happen one way I'm looking at it it's just you're not too, I'm older than you and I was not using R2 years ago and at this point I'm just going to say R is better than sliced bread and something that's interesting about that statement is you may not know this, sliced bread because it's so good and this is what I was interested in as soon as I figured out how to put a dummy on that one Tim with her I'm a sad person yeah yeah I mean there's nothing you know but we have an R person on our team so we're trying to I'm not by nature barely, it may not seem like it but I'm not by nature very much of an advocate, I believe use tools that work right but that's different when it's use tools that work for you versus use tools that work for the group and one of the things about R is it's very popular for a lot of good reasons right so we're having to learn it enough how they code and understand it and not look at it going and you'll get there I mean it's just like you're a SAS person or a SAS SQL and see you can do SQL inside of R, I promise I don't know this for sure but I think it's a great bet high probability of being accurate, the first time you looked at SAS you felt the same way so it's a slightly different ecosystem it's so extensible that you, it's not like you can get away with never using anything else but you can start eliminating, I mean literally I make my slide presentations in R Markdown, I'm making websites in R Markdown Word documents, which are documenting and it works what it is I mean that's what I mean by extensible, you can start eliminating all kinds of other things or previously you had to know just to get your work done you're still using them but you're using them with less effort but that's not going to happen overnight this website may serve as a cheat sheet but make your own yep, not the wrong way next thing I know I'll probably call you and say hey oh, I know that I specify the format this is on the website yes yeah that's a good one it's a little hard to do yeah exactly right I'm glad you brought that up my obnoxious little timer is going to go off in about a minute guys, how are you feeling? should we keep going or should we go to the next thing? I'm reading that as we should go to the next thing alright so one question somebody asked which is definitely one that I hadn't covered because we're doing some work here with only filtering on numbers and numbers are a little easier to filter on but if for example you want to define text you could do that like find me all of the rows where equals Honda Civic we could actually add that on here filter, make, model equals and then we're going to use the double equal sign and then we have to put the text inside of quotes different from numbers and that will also work we should probably only get one result out of this one row it's at least worth pointing out that filtering on alphanumeric data is a little bit more complicated than filtering on numeric data but you could add your other filters to make it more because Honda Civic is distinct but if you added like let me comment out my first filter and we'll see if I do get well there's only one Honda Civic in there if I filtered on just Honda I would probably get more results but let me just point out again because it's coming out of the statistical world filtering on alphanumeric becomes more complicated and I need to use a different set of functions that I don't want to muddy your brains up with right now because difficult so is it just like filtering is trying to pick up rows and selecting is trying to pick up columns exactly right filtering is rows and selecting is columns okay so we have about twenty minutes left what I'd like to do I don't have any more exercises ready but you can do them I don't think I have more exercises ready let me just double check yeah just wanted to but this is handy information to have so visualizing it's like making in this case we're going to be very simple here let me note that I'm going to use this package called ggvis which stands for visualizing grammar of graphics there is a more expansive package for visualizing called ggplot2 and we do a workshop on that and we have video I'm certain we have the training materials this room's video system was broken so I'm not sure if we have video on that but we have all the training materials it's not it's easier to learn I think ggplot2 if you stick with the syntax that we've already learned and then you'll get a sense of how you can do visualizations and then you can more easily shift into ggplot2 some people argue you can never really get away with from both but I probably you could only use ggplot2 for a really long time again we're going to use ggvis because it is so closely related to the syntax we're already using and I don't need to bother you with this other syntax which is not vastly different but you've learned a lot today so we're just going to do some bar graphs and histograms and stuff like that let me see where that comes in so in this case I'm going to load several packages so the tidy verse of course ggvis I'll note that ggplot2 is actually defaulting inside of the tidy verse but I want to use ggvis I'm bringing up that data set again and I'm also going to bring up this actually not certain if I'm using this right now leaflet for map making we'll get there in a sec so I'm going to run all of those I'm going to read in two data files one from an onboard data set that we've been using already cars just to make sure that we're starting fresh and the other I'm reading off of the internet which is some data of latitude and longitude of all of the Starbucks in the United States at least the continental United States from about 2012 so the data is probably a little bit out of date but it gives us some stuff that we can plot so I'm going to run both of those and I now have two more objects or maybe one more object up here in my environment which is Starbucks which has 10,000 observations or rows plus the cars alright so sticking straight with the maybe the most simple visualizations first if I wanted to do a scatter plot I had that cars object that I've been working with and I'm just going to send it to ggvis and I'm going to just plot my x and y so my x should be weight and my y miles per gallon and then I want to then basically layer points is just saying show me all of the points on a grid so I'm just going to run these three lines first so you can see what happens and it draws this nice grid and gives me my x y points alright for people who are used to doing things more advanced than this which would not be surprising if you've done a lot of statistical work I can run this through a regression linear model where I'm identifying the model and also get a confidence interval just by adding in this third line I now have my regression line and my confidence interval okay you can do something very similar to this with a few more characters in ggplot2 so just straight x y plotting which a lot of people like to do now maybe I want to do a bar graph now I've been trying to gracefully skirt the whole idea of dealing with factor data but you can't get away from it for too long when you're wanting to do something like a bar graph what I want to do here is I want to create a bar graph of the number of four cylinder six cylinder and eight cylinder cars and if I want pretty labels the easiest way to do that is to turn the numeric four, the numeric six and the numeric eight into factors so that they have text labels so I can display those text labels more simply that's all that's happening right here is I'm identifying my cylinders as a factor of cylinders with labels okay and when I run that you'll see there are my pretty labels four six and eight and that's just simply a straight up frequency of the four six and eight cylinder vehicles yes so that's a good question with ggplot yeah I think it is mostly convention but I'm not sure that it absolutely has to but it would be hard to get out of it is what I think oh you have an answer mostly because it's easy to interpret to read the code about what's happening first and next and it also saves you some typing and that if you're using pipes in the data frame or our data set is here at the very beginning it gets sent through those pipes and you don't have to keep specifying you can start talking about the variables in the data set without referencing the data set which is kind of nice continuing sites in the cars data set and you read it from left to right which most people are used to reading things that way conventional are of doing things in brackets and then having things expand out which is perfectly legitimate but is also somewhat confusing looking at the code about what happens first and what happens second it kind of radiates out from the middle and so this format gives you a little bit easier for a second person to kind of read through the code and see what's going on the code we're using is relying on using one data or is using data frames as the primary type and so pipes work really well with data frames as far as passing through the sequence out of video for the reason but you don't have to use pipes is what I would say I guess I was trying to figure out it's your question do you have to use pipes with GGBiz or do you have to use pipes in R and I'm not 100% certain that you can use GGBiz without pipes program in R in a base R method where there are no pipes at all so in that sense the pipes is stylistic leaning into this concept called literate coding where you're describing your code more holistically for others and including yourself as others okay so that's a bar graph histogram which in this case is going to be very similar which is just a frequency of the cars with certain miles per gallon so the only difference is I'm using this layer histogram as opposed to layer bar okay they're going to look almost the same at least stylistically right so I know that I have looks like 15 cars that get I'm sorry 5 cars that get 15 miles per gallon and 5 cars that get 21 miles per gallon and I can further manipulate those labels and the axes and all that kind of stuff I might also mention at this point that this data set this onboard data set comes from the 1970s from Motor Trend Cars Magazine and so car mileage has gotten a little better since then so here it starts to get maybe perhaps a little more interesting we were doing plotting of the XY before but if we wanted to start grouping by cylinders and let me just run this you'll see that what I'm doing is I'm first grouping by the cylinders and then I'm identifying the fill color by that same grouping as a factor and layering my weight and miles per gallon as well as using this little feature right here that says stroke equals black and stroke just means that each one of these little circles doesn't show up so perfectly right here but if I had converted it to a different output like a PDF or whatever you could see more clearly each one of those little circles is outlined by a black line so that's what stroke means so one thing I would say about this is that it's not real clear that I need the line graph effect here I'm just trying to demonstrate some of the features but for this particular data why I'm drawing a line I don't know I'm just doing it you can see that it can be done but what is a nice feature is that it's easy to highlight my groups by colors right and that's that's done right here so it can more easily tease out that different cylinder cars have very clear miles per gallon and weight characteristics and then if we wanted to get a little more advanced you'll see here that I've done away with the line graph connecting I've put in my regression line that same technique that I used earlier the other thing I'm doing is I'm expanding the size of the point based on a characteristic right here size and the characteristic is horsepower so I'm adding some to my visualization where the larger the point you know that the car's horsepower is higher and again visually you can kind of more quickly tell that larger horsepower cars at least in the 70s tended to have lower gas mileage right um higher horsepower cars I should probably say lastly I want to demonstrate this box plot because I love box plots I think they I think they visually impart a lot of information in a short amount of space if you haven't seen them what's going on here is the horizontal line so I'm box plotting on cylinders as a factor with miles per gallon so the horizontal line so this is four cylinders, six cylinders, eight cylinders horizontal line is the mean should be the mean it might be the medium actually not 100% certain at the moment but the mean of that set um and then the first box and the second box makes up the middle 50% of the population of cars that are four cylinders and then the lines represent the moving out um lower and upper 25% as well as the dots those are even outliers still so if you haven't used box plots and you have data where that's relevant um you might want to look into it I find them to be really nice way to impart lots of information um looking at my time I've got about 10 minutes here so let me just show you I think I might show you in this case over here the mapping just one more way of visualizing data so here I'm calling my tidy verse and leaflet libraries I'm reading in my Starbucks data putting it into an object called Starbucks very first thing I'm going to do here is I'm going to filter where state equals North Carolina because I don't I don't want to look at 10,000 rows of Starbucks plots um and I think I end up with something like I don't know 300 points out of this and then the easiest and most interactive mapping tool you can use in R is the package called leaflet so I'm calling the leaflet command and I'm first adding in my base map which has the row is coming from I'm pretty certain open street maps uh so that has a nice set of features like road names and rivers and things like that those things you can manipulate I'm giving it a center point with longitude and longitude and setting a zoom point so in this case I'm setting it to be a view of the triangle and then lastly I'm giving it my data set which is that oops which is that um data set I created right there just North Carolina Starbucks I'm identifying the latitude and longitude that are in my data frame and I'm giving a pop-up so a Starbucks store name for every one of these and then so very easily I've then created an interactive map every one of those pops-ups uh generates the store name and I have control over what gets generated I can zoom out um and so that's a really nice way to sort of an open source way share some mapping information you can do other things like add polygons make coraplets etc the more you're in the GIS um the more valuable this will be to you but I wanted you to come away seeing how this member when I mentioned after right around exercise 2 that you might be wondering why would I do any of this if I know how Excel works and I just want to sort of suggest that there are some reproducible ways beyond just the data wrangling that you might want to do this that being said I don't have any more exercises so I'm going to hand out these feedback forms uh you're not required to come out I'm happy to get the feedback um I appreciate you coming I want to remind you that my department has walk-in hours for questions um every weekday Monday through Friday up in the database lab which is on the first floor of Bostock um there's lots of other workshops you can send us questions uh most all of us will schedule one-on-one consults as in addition to walk-in hours um I hope this was useful to you and uh if you have questions you can you don't have to fill out the feedback form I don't know that I gave enough time for interactive questions but feel free to ask me now or go back to your desk and try to apply this to your particular project that's when I get really interested is how does this relate to your particular project I can't say in this workshop setting but I am more than happy to sit down with you and try and direct your learning a little bit more specifically