 I would like to take a moment to honor the land in Durham, North Carolina. Duke University sits on the ancestral land of the Shikori, the Eno, and the Katala people. This institution of higher education is built on the land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land has borne witness to over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of Indigenous and Black peoples. There is value in acknowledging the history of our occupied spaces and places. I hope we can glimpse an understanding of these histories by recognizing the origins of our collective journeys. That's in my mind, and I think for many people a very serious issue, and it feels like an abrupt transition to move then into GG plot, but I appreciate you taking a moment to think about that with me. So what I do in the pre-survey is I ask people where they're coming from because this is an open workshop and I have no way of sort of knowing what people are bringing to the table and if there's ways for me to sort of skip parts, I will do that, but I'll just share with you some of these slides, and partly why I'm sharing these slides and this information is I'm trying to model some of the things you can do with R, slash R studio, slash a reproducible workflow. In this case, everything I'm presenting here today is done in R with various R tools. One of the exceptions being that the registration system is a third-party company called SpringShare, but the data gets pulled out and the survey that I sent you, of course, was a Google survey, really cool thing. You can use this R package to actually read Google Sheets online and just, you know, import that data as a spreadsheet, and that's how I made some of these slides. I try and do a lot of this work because R, one of the ways R, I think, distinguishes itself from Python and other programming languages. It's a data-first programming language. It's all about the analysis. It is a fully Turing-compliant programming language, so you can do pretty much anything you can do with R. You can do with Python. There are some minor exceptions, but if you're into data analytics, if you're trying to analyze things, it's a great tool, and it has a very complete, mature sense of the existing data lifecycle. That means everything from import to reports, and eventually putting things into repositories, much of that you can actually orchestrate and automate through R. And I mention all that because being able to control all that in one pipeline is, in part, what we mean when we talk about reproducibility. And so I am very bullish on the tidy verse and reproducibility as an employee of the library. You can imagine that reproducibility is important to us. It's one of the reasons we get involved in this, but it's not just important to the library. It's important when you're trying to get grants. The National Science Foundation and many other foundations are starting to require that you ingest and put things into repositories. They want to know that your work can be reproducible, and so this is a tool that helps with all that. I won't spend a whole lot of time talking about reproducibility today. We did more of that in part one, and I can go into detail if anybody wants more. But I just want to highlight that it's important. So this first slide just tells me who's in the room by your quote-unquote academic status. I think this is the first time I've seen Nicholas School be the top ranked by attendance. So I mean, it's not a race, it's not a competition, but go to Nicholas School. And this is really like the top nine schools, a lot of health, a lot of bio. Great to see you all here. And then the other category is just a conglomeration of many more onesies and twosies. And as I'm looking at this slide, I'm thinking to myself, oh my gosh, I really shouldn't have a tick mark of 2.5 and 7.5 when we're talking about people because no one is a half person. But anyway, I can fix that. And we can talk about that if anybody is curious about those kinds of things. This is a time series slide, which all of, again, all these things were done in R, this one also done in GG Plot, just tells me how many people responded to the pre-survey and over what days. Time series is often done in particular disciplines, maybe not the one you're in, but it's nice that you can do things like this. This is what I really want to know, is what kind of background are you showing up with? I would say something like one-third to maybe two-thirds of you are not doing a whole lot of coding, and that's okay. Maybe a comment there is that, well, I hopefully talk a lot about GG Plot today, and GG Plot is a sort of command-driven programming grammar for making visualizations, which I think of as coding. And, you know, it's possible that you would didn't interpret my questions the way I intended them. But just sort of understand that that's kind of where we're going. We'll be very tidy-verse focused. We'll be very GG Plot focused, and we'll be making visualizations. I ask people if they had used VizTools. Most of you, the way I read this, are not doing a whole lot of that. That's totally fine. You're in the right place. I always like to ask about version control and shell command line interfaces. I think of our programming languages, Python, I think of coding languages as kind of command line interfaces anyway. But again, this all leads towards, in my mind, these all lead towards foundational aspects of reproducibility. So I like to get a sense of where people are, because I think reproducibility will continue to be an issue that people are faced with needing to learn more about. And I just like to know where we are at the moment. Hey, John, these are the things. Sorry, there's a quick question in the chat about sort of how our compares to something like Stata in terms of writing code, yeah. I don't see if I can get mine. Maybe I can, I don't know if I need to read that in particular, but somebody said they might leave early. That's totally fine. And I'm not seeing the one about Stata, but let me just try and answer that. Stata is a statistical analysis application. And you can certainly write scripts that allow you to sort of create a linear flow of the analysis that you're doing, and you can use it then logic. But as far as I understand it, it's not a full-on programming language. And it may excel in some kinds of statistical analysis over what R is doing. I wouldn't have that detailed level to be able to say, oh, these are the things you can only do in Stata. I would say, generally speaking, most statistical analysis that you can do, if you can do it in Stata, you can probably do it in R. There might be slightly different techniques. But again, Stata is not a programming language per se. So one of the distinguishing factors when you get back to R and the tidy verse is, as I mentioned, everything I'm doing here I'm doing in R. So I wrote a script that ingested the data that created these graphics and then created the slide deck. So that's something that I think would be harder to do in Stata, kind of literate programming. And if you want to know some more about that kind of reproducibility aspect, you can look back at some of the videos from part one, which as long as I brought that up, I will just let me see if I can get out of this. I just want to point out that I have this website. I'm just going to get this URL copy. And I'm going to put that in the chat. Sorry, my toolbar is being slow. Alt-H. This is a kind of a sub brand of my Center for Data and Visualization Sciences. And it's mostly the links to all the shareable resources for the R workshops that we're mostly doing. And I'm focusing right here on this quick start for R. And last week, we did mostly these two videos, which are more bite-sized. They move kind of quickly, but you can rewind them and play them at whatever speed you want. And this week, we're doing mostly these two videos. There's also a whole host of other videos, sort of subtopics that I'll just bring to your attention. So if I cover merging too quickly or we don't get to it, if I cover pivoting too quickly or we don't get to it, there are ways to dig into the material. Speaking of which, while I'm talking, let's see, I want to find this link and send this to you. This is a little mini survey, one more that I'm going to send you, because we have a lot to cover today. And I want to cover it all. But what I want to know from you is, like, what are the two most important things that you really want to make sure that you get out of today's workshop? And I will push those to the top. So a lot of people said that they felt really comfortable with importing data and they feel reasonably comfortable with editing scripts. Again, I'm going to do a tidy-verse approach and we're going to be editing our markdown scripts. And if those things are not clear to you, you can go back and look at the videos. What we're going to try and do is really focus on ggplot and interactive plots and pivoting and joining. And I'm making a distinction here between makeplots and ggplot2, because what I'm trying to say there, whether or not I captured that in the question, is that you can make graphics in R with many different functions, many of which I might call base R graphics, base R being kind of the nomenclature that refers to old school R. And I don't mean to be a kind of fire and brimstone evangelist about R. If you're comfortable with base R, that's fine. And if you want to stick with your base R graphics, that's also fine, as long as you can do what you want. Again, my approach is more of a tidy-verse approach. I think it's modernized and it lends itself to more of that full-featured reproducibility. That's where we get into ggplot2, which is the very first package that makes up what eventually became known as the tidy-verse, which is a suite of packages that extends our into a more modern data science approach. The packages all work well together. They're sort of symbiotic. They have excellent documentation that's all online. And that's why I'm going that way. It does things like automatically draw legends and all those kinds of things. I have a little subsection module today where we're going to talk about interactivity, assuming it ranks high enough. It's surprisingly easy to take a ggplot image and make it interactive. We'll talk about pivoting and joining data because those are parts of data wrangling, which we mostly covered in part one. But they're a little bit advanced. It's not the straight up, you know, subsetting by column, subsetting by row or making new variables. And they're important because ggplot assumes or likes to prefer what we would call a tall data format as opposed to a wide data format. And pivoting in particular helps you shift your data around in ways that are natural for ggplot to iterate over row by row. So we'll talk more about that. I always like to ask this question. Again, this is another kind of sub question about reproducibility. But what is your, oh, actually, sorry, different question. What is your current visualization tool? Are you satisfied with your visualization tool workflow? What I am getting at there is, you know, when you set out to make plots, does it feel like it's flowing in your overall workflow? Or is it complicated by, you know, you do some stuff here like cleaning data in Excel, and then you're visualizing and stata, and then you're outputting into a PNG, and then you're importing into an Adobe product and tweaking it. And then you're outputting it again and pasting it into Microsoft Word. The comment I would like to make is that if that's your workflow, that all of those different steps lend themselves to break points or barriers towards reproducibility, because it's really hard to document. Oh, I copied this. I pasted it here. I moved it into that other directory. Whereas if you're doing it in a programming language, you can write a script that manages all of those aspects. The only point of this second slide, this darker slide, is just to show off or model a little bit of what ggplot can do with one very simple function. I changed the theme of the slide from kind of a stark white to a darker format. And I didn't have to think about the colors or anything like that. It just made it change by itself. So it's nice to know about themes. Very quick comment about this Microsoft team site. RStudio in particular, but the R community in general is known for its community-oriented help. You'll get very used to typing things into Google like, how do I make a bar plot in R using that magic phrase in R? And a lot of times you'll get feedback from or you'll get links to a site called Stack Exchange, which I always think of as kind of like a 50-50 site. About half the time it's really helpful and half the time it really confuses me. Anyway, leading back towards R being a helpful community, they have a companion or alternative to Stack Exchange that's specifically about R. It's very helpful. You can Google the phrase RStudio community and you'll find lots of people online there who can help you solve problems. I can try and help you solve problems. And in addition to that, I created this Microsoft team. It's just for the Duke University community, including the Med Center. You have to have a NetID. If you're in Teams or if you're in Microsoft Outlook, like I think we all use for mail, you can click on this Teams button, then you can click on this Join or Create button, then you can put in this code. Or you can click on that link. And as long as you have a NetID, you should be able to use that. It's a fledgling idea that I started last semester. I got a little bit of traction. The idea is that you could post questions there or you can even monitor it to lend your own expertise because I don't know everything there is to know about R and there are lots of people who have different aspects of knowledge about R. So if this could turn out to be a helpful community locally for Duke that's not broadly accessible, one caveat that I will mention, but I don't think I need to, anytime you're posing questions with data, if it has personally identifiable information, and of course you want to scrub that information out. You can take a picture of this if you want, but again this slide deck is in one of the links I've sent you and actually in insparied inside of this GitHub repository. But this is just to say that my Center exists to help people with mapping and GIS, with visualization. We help with STATA, we help with SAS if we can. Our expertise is more with STATA and R and Python, Tableau for visualization. We have different experts in all of those areas. We have two experts who specialize in data management. So everything from writing the data management tool that goes into your grants to finally ingesting or finding the proper repository for your end result data and outputs. You can get in touch with us through this email or this website and you can get in touch with me directly. I'm happy to consult with you. If you just want to schedule me because you're confused about something, I'm happy to meet with you. Everything we're going to do today should already, you should already have seen that URL, but the code is going to come from there. This is a new URL. We'll do some exercises out of there today. And last point, I think I need to move this slide. But when we talk about asking questions, I want to at least give you one tip. Our community, the R community has come up with this concept. I don't think it's a standard or convention yet, but it seems to work well. About making your question a reproducible example or a reprex. So the idea here is that when you ask a question, you reduce it down to its simplest, most finite code set and data to reproduce the code. I'm going to contrast that, and you can go follow this link if you want to read more about it. There's plenty of good information. I'm going to contrast that with the idea that sometimes you'll see on, for example, StackExchange or anywhere, someone will send like 500 lines of code and they'll go, I got a problem in one of the subroutines. I'm not sure what's wrong. It's really hard to answer a question like that. It's much better if you can reduce it down. Just get a subset of example data. Say every time I do this function, I get this result. I don't know why. If you can reduce it down, you're much more likely to get a quicker answer. If you're brand new to this and you don't know how to do that, I am sympathetic to that. Reach out to me. I'll do what I can to help you. And this last slide and then we'll get done and I'll open up the floor. This is really just one more example. It consists of a panel on the left of pre-survey results and panel on the right, which has fake data in it, actually duplicate data from the post-survey, which I will hopefully send out a link before the end of today. And so the post-survey you can ignore, but this is another example of what you can do with a few lines of GG Plot code. So I've got pre-survey results based on the various things I hope we'll cover and we'll look at your survey results here in a second, but based on visualization, GG Plot 2, interactive plots, join, give it. And this helps me understand where people are. Coming into the workshop and if you'll fill out the post-survey results, it'll help me kind of assess what, how, whether or not I got close to delivering on the stuff that I was hoping to teach. So that said, I want to take a quick look at the at the survey wherever it is that I just sent out. I'm looking at my Google Drive here. Oh, here it is. And I will try and display that to the screen so you can see it. Oh, goodness. I just did it again every time I do this, but no big deal. I need to un... Oh, I guess I un-pause the recording or maybe Drew did. Good. Let's see. Could I ask a question, John? Sorry. Yes, please do. Please do. Where is the PowerPoint you're referring to? Ah, less. I mean, all right. If you can see... That's a great question. It's a great question. And it is actually buried. And I will also say not only is it buried, but I'm still writing those scripts that make those slides. But I put the, and so they're not as reproducible as I would like them to be, but I put them in the Skid Hub repository, which I just put a link in the chat. Oh, no. That's the wrong link. Sorry. This link from the flipped repository. I'll put that link in the chat. And I'll just, I'll drill down into it. There's a folder called Slides. And there is a... Let's see. Here it is. The one we're doing today is this one. 2021.0204 flipped part two, vizslides.html. So you should be able to download that. And the way I usually do that, if you're not used to looking at GitHub repositories, is that if you go to that URL, the last URL I put in the chat, you can click on the green code button and you can download this zip of all of the files and then unpack that or expand it on your local computer. And we covered some of that last time in part one, but so I won't cover it now, but if you have trouble after the workshop, let me know and I'll be glad to help. So now I want to share the results of the survey. Most people want to see ggplot followed by regression, so we'll do that next, then we'll do interactive. And follow that with pivot and then with join. Okay, that's a great plan. One caveat I'll tell you about regression is what I'm really going to show you, and I probably should have said this first, but what I'm really going to show you is some of the tidyverse tools that allow you to manipulate the regression model outputs. I'm not really, this is not a stats class and we don't have time for a stats class and regression and all models are really big topics. And I'm not going to get into like how do you interpret the p-value or how do you decide which kind of test you're going to do. But I will tell you some of the existing tidyverse tools that help you deal with regression models so that you can go forward with your analysis. And I should mention that we do have two grad students who work in our lab who are available on the chat. Let me go back and share this other. I can find it. Here we go. One more. This is my departmental website. You can use slash data. Put that in the chat. And what you'll see on that side is there's a there's a chat button there and the hours of when people are available. And Annabelle who and Hanleng Su are our two grad students that are master's candidates working in computational economics. And we don't per se offer specific research design help on on choosing your best statistical model. That's not really, we're not really a statistical consulting operation. But we'll do as much help as we can and then refer if it becomes too discipline specific. But Hanleng and Annabelle are our statistical experts and they only work on chat. The rest of us will take appointments which you can do through ASDATA. So if that is your particular need, please keep us in mind and you can reach out to them directly when they're online. All right. So starting off with GG Plot. So what I would like to do is I'm going to try and open the floor. We were in the videos that you would have watched. We eventually recommended doing these exercises right here. Exercise 02 biz. And just so you'll know there's the answers to exercise 02 biz. And I don't care which one you choose. You could open up the one answers and you're still going to learn something which is my whole goal is just that you get engaged with the information. And so if you have questions about those in particular, I would like you to go ahead and bring them to our attention right now. And if you don't have questions about those and I'll wait and you can just, I mean, you can unmute and just jump in at any point. I will try and do some different exercises that you haven't seen yet. So it'll be new for the people who did the prep work and it won't be out of the realm for those who are just joining us without the prep work. So let me pause for a minute and give people a chance to unmute and ask questions. It looks like Slottie has a question. You need to unmute Slottie. Yeah, there you go. Yes. So I was attempting one of the exercises and I'm sorry I didn't notice the answers thing. So I might have looked there, but if I can just directly ask when I was trying to create the histogram for the x is equal to hwy, I couldn't use the bin width function to order the, I don't know why, like it didn't work for me. Okay, so I'm going to go to exercises two. And what I'm going to do is I'm going to click run all. So the results for answers are all in my inline in my script. And I'm going to scroll down to the question that you were asking about, which I think was this one. Yes. Right? Yeah. Okay, or actually this one. So you can see that the answer is right there. Sometimes the problems may be some tactical, like you don't have a comma in the right place, which would be my guess. So I was, I was putting the numeric value two in quotes, but I didn't have to. Yeah. Correct. Correct. That's kind of an R standard is that when you're doing a argument assignment like this, if it's a single equal or even if it's a double equal, if it's numeric, you don't have to put it in quotes. But if it's character based, you usually do. Got it. Thank you. So I'll take a second here to see if there are other questions. And if not, I will jump jump into the other exercises that I have. Let's see how am I going to do that? All right, I'm going to do it kind of an old school way. So just in case you're trying to do this with me, which is I think kind of difficult with R. And I should mention I'm used to people coming to workshops online and I would give them exercises and I would walk around the room. And it was a great way to personalize, you know, personally answer little bits that I might have not been clear about. They're confusing to a new person. It's a lot harder to do that in a Zoom context, which is why I kind of quote unquote flipped the workshop, tried to make a bunch of videos that you could watch, which you can watch at any time should cover some of your questions, hopefully even even if I don't cover them today. But it is a little harder to sort of work as a group. So I will try my best to show you what I'm doing. I'm going to go to this repository. It should be in your chat box. It's the one github.com slash live John slash intro to our the number two to is number two underscore exercises. And I'm going to there are much easier ways to do this, but the foolproof always works way is to click on the green button and download this repository, which I uploaded into github using our studio. And I created it in our studio as an R studio project. And I did that because making it in our studio project makes it easier for you guys to run on your computers. Because I was using some reproducibility techniques, some simple reproducibility techniques, like using relative file paths, which makes it possible that you don't have to set your working directory to whatever I happen to use. Now, none of that makes any sense to you, you can ignore or go back and watch the part one video. I'm going to put this expanded repository unzipped repository on my desktop. And before I completely finish, I just want to show you that one more time. I downloaded it. I'm on Windows. So I right clicked it. It's really important to expand or extract all. If you don't extract all you I don't think that you're you'll be able to work for a little while and then you'll end up with some problems. So even if I'm on a Mac, make sure that you expand the zipped file. And I put it on my desktop, you can put it anywhere you want. By the way, if you're not seeing at this moment, like a blank blue Windows screen, let me know. But I'm assuming I'm in the right place. What I just did is I opened up that folder that I expanded onto my desktop, and I'm going to double click on this our project file. And that will launch me launch this repository as a contained project into our studio. And there's my project name right there. And if I click on that, I can flip to other projects. And I'm going to start right here, oh, to a this answers. Let's see if I have an O2 a that doesn't have answers. I do O2 a this I'm going to start here, see if I can answer my own questions. All right, so this puts me into a our markdown script, which has markdown pros in between the code chunks. Right, the code chunks are these little gray parts. I'm going to make some quick changes to my R studio. So hopefully it's a little easier for you to see global options appearance, you do not have to do this 150%. Okay. Oh, I don't want to restart R, but I will if I have to. Okay. Oh, restarted it for me. How nice. And the other thing I'm going to do is I'm going to press in the editor view, I'm going to press control shift one, because one of the advantages of doing an R markdown script is that you don't really need all those other little quadrants. So there's some metadata at the top, I commented out this particular line, because I'm not all that concerned in this context of creating any kind of derivative output, a report, but I could make a word file or PDF file or whatever. So the very first thing I'm going to do is I'm going to load the tidy verse library, which consists of not only, let's see if it'll tell me hopefully I'll get some feedback. Yeah, I got some feedback. It consists of ggplot and a whole bunch of other tidy verse libraries all work well together. Talked about some of that in part one. And one of the nice things about ggplot is it comes with some onboard data that we're going to use in these exercises. Okay, so goal one is just to make a standard scatterplot. So starting with the mpg dataset, let's make a scatterplot. Let's take a quick look at the mpg dataset. I'm going to highlight that and I'm going to press control enter on my screen, and I'm looking at the data. If I was back in the four quadrant view, I could go to help and type mpg in here and get kind of a code book for the data. But sometimes you can just look at it and figure out what you want to know. It's 234 rows, 11 columns. The first column is called manufacturer. It's a character style column, and the first 10 rows all have the same category, Audi. This is information about miles per gallon of cars, specifically their city and highway mileage. Has some other information about the cars that make the model displacement. I'm not really that big into cars, but just in case you're not into cars either, displacement is a measurement of the size of the cylinder. The cylinder is what the fuel goes into, the bigger the cylinder, the more fuel you burn. So it is a factor in identifying mileage per gallon. If you're burning a lot of fuel, you may be getting lower miles per gallon. So our question is make a scatterplot using displacement as the predictor variable on the x-axis and highway, that's this, that's highway, that's displacement, as the response variable. And the way to do that, the simple way to do that, I'm going to put a, I'm going to put a comment right there because I want my code to kind of, I don't want to get down there yet. I just want to do this part. I'm going to start by saying, what is my data frame? And then a tidy verse convention, this is a conjunction, it's a way of stringing together multiple functions as if it were a sentence. And the tidy verse may have heard me say a moment ago, GT plot is the very first tidy verse package, it uses a different conjunction. And eventually they standardized on this as the conjunction. But initially, ggplot used a plus sign as a conjunction. You can monomatically think of both of them as saying and then. So if I was reading this data sentence, I would say take the miles per gallon data frame and then send it to ggplot. Now, the standard syntax for ggplot is as follows, ggplot data equals a data frame IDF and then aesthetics where x equals variable one and y equals variable two. And that sets up the infrastructure so that you're ready to visualize something and use the conjunction and then, and then you add layers, which is what we're going to do. We're going to make a scatter plot, so that would be g on point. And essentially that's what you need to do. That's the formal syntax for the grammar of the graphics, gg standing for grammar of graphics. In practice, now that the tidy verses has grown up a little bit, we write it a little bit differently. So I don't need this part formally because I'm piping it in right here. I'm piping the data frame into ggplot. So mpg is assumed. And I don't have to say x equals or y equals because it's assumed that the first position is going to be x, second position is going to be y. If you feel more comfortable writing them, there's nothing wrong with it. So my question is x-axis is displacement disp and y-axis is highway. If I stop there and run that code, of course, it helps if you spell it properly. So I missed an L on my displacement. What ggplot has done for me is they've just driven, it's drawn a blank canvas that takes into account the data ranges of displacement and highway. So displacement has a value that goes from presumably 0 to 7 or maybe it's 1 to 6. And y-axis highway has values going from 0 to 50 roughly. So that's all that happens. Of course, no one stops there. Now I want to add a layer and you can add multiple layers. So the layer I'm going to add is the geom point. And notice when I started typing geom, that I got a context menu with all of the ggplot geometric functions so I can make all kinds of different plots. And I want to use geom point because that's the one for scatter plot. Now, before I go much further, here's a good time to kind of stop and pause and switch back to my web browser, which I hope you are seeing. In a tab, I'm going to type in ggplot.ggplot2.tidyverse.org. I'll put that in the chat. And this technique works pretty well for all the tidyverse libraries. So if you're working, if you have a question about a deplier function, you can put deplier.tidyverse.org. And it starts with a cheat sheet that's really handy to download. And it starts with some information that you can read through to get started. And the other thing that's really nice here is that it has a list of references for every function. And I just mentioned that we're going to use the geom point layer. But if you clicked on the layer over here in the sub menu, you can get more information about all of these layers and how you use them. There's geom point right there. This documentation you can get on board in our studio. But I find it easier to look at online. All right. So once I put this in, remember, when we just wrote this much part of the code, we got a blank canvas. And now I'm going to put in a layer to visualize as a scatter plot. So I'm going to highlight those three things and press control enter. I could press the green button, but then it's going to try and run these two bits of code and give me an error. And there I have my scatter plot. So let me at least comment those two out. It looks to me like, oh yeah, that's the basic scatter plot. Okay. So let's get rid of some of this so that I'm going to move this up a little bit higher. So it's not in the way. And there we go. Make a scatter plot. That's pretty much the simplest plot you can make in GT plot 2. But the nice thing is once you understand that you're using the aesthetics argument to map variables of a data frame, and you can use these various geoms to generate different and multiple layers, then you have an idea of how you can start to construct a more complex visualization. So the second question here is add color to each variable by the type of car, which I'm giving a hint here of the type of car is in the variable class. Let's go back and look at MPG and look at class. I'm going to scroll to the right. There's class. And if I just kind of scroll through that data, I can see that I've got information for minivans and midsize cars and SUVs. So I want my plot to have different colors based on the type of car it is. So I use the aesthetic argument up here, and that's a global aesthetic argument. I can also use different aesthetic arguments specific to the layer. So I'm going to use both. And the function I'm going to, or the argument that I'm going to leverage here, is the color argument. Let me go back just for a second to my documentation. Every geom has a series of aesthetic arguments that you can set. Sometimes they're a little different depending on the function. But color is one that I'm going to use. You may notice that color here is spelled with a U. RStudio GG plot tidyverse was originally developed in New Zealand. They use the British spelling there. But they were very nice to people like me and they allow you to use either the British or the Americanized English spelling. You can use either type in color equals, and here I'm going to put in class. And when I run that function, I get my color, my graph, where each dot is automatically colored and the legend is automatically written. And these are the categories of the class variable. I mentioned that you can have aesthetics either local to a functional layer, or you can have them global. That means if I put in another layer right here, for example, geomsmooth, then that means that X and Y are available to both layers. Let's undo that for just a second. I want to point out that I could also do it this way because the aesthetic, we only have one layer. So I could have three arguments, my X argument, my Y argument, and my color layer. It's going to produce exactly the same looking graph or visualization. And your decision of when you decide to make it a global option and when you decide to make it local depends on what your overall goal is. But if you can do it both ways, maybe your decision is to use global when it allows you to type less. John, may I interrupt with a question here? Absolutely. So I've had issues in the past with, say, I'm looking at, say, the top 30 hits of microbes that show up in my samples. Between my visualizations, I can't seem to keep the same colors for the same, like Staphylococcus. I need all of those purple or something. I've used color brewer. I've tried to designate specific color to specific organism but I'm unsuccessful in getting a consistency of color. And for readers, if I'm trying to publish a paper, it tends to be helpful to keep that consistent. I understand the question and it's a good one. And I would have told you, although you added more information and now I'm not sure, I would have said, well, you should be able to use an additional function called scale underscore manual to assign colors to particular categories. It could be that the, I'm not sure why it's not working if you have done that. And I don't know if maybe there's a possibility that during today's workshop, we could look at your data and we could try and troubleshoot it. But if not, we can work that out in a consultation. Right. The other challenge, I think the two challenges that I hear are, one is it sounds like your categories may change based on the frequency of the values of the categories. And you should still be able to use, there's a, one of the tidyverse libraries is called four cats. And you should still be able to use four cats to kind of hard code the categories in a particular order to particular colors. But syntactically that could get a little confusing. And you said that it was switching up on you. And that gives me a little bit of pause without seeing your data. I'm not sure what's going wrong, but I'm happy to work on those details with you. All right. So what have I got here? I see I've got a little error and I'm not sure why. I've got a little red squiggly underneath my close quote. I'm not sure that's even relevant. I'm going to try this one. Yeah. That's, that's a mistake. And anyway, the quote, the code ran. So I created color. I can, by the way, I don't know if I could overwrite this. We'll see if we can override this. You can assign colors manually. So I can type in the word red. Let's see if that, yeah, overrode. It came second. So it overrode the dynamic assignment. But that's good to know that you, depending on whether or not you put color argument inside the aesthetic function or outside the aesthetic function gives you different outcomes. Okay. Lastly, add a regression line. So I'm going to put in one more conjunction. And I'm going to type geom and the regression line function is called geomsmooth. S-M-O-T-H. And notice, by the way, that I have a lot of the arguments are popped up here for my, for my convenience, but I could always go to the online or onboard documentation. I'm going to just run this without any arguments and see what happens. And so what it did is it drew a low S smooth regression line with a confidence interval standard error. And I just want to show you that there are some other things you can do with geomsmooth, such as I could start by saying I want my standard error to not show up. So if I rerun that, then the standard error is gone just by saying S equals false. I could also change the method to a linear regression rather than lowest smooth method equals lm. And I think by normally you would put those in double quotes, but I think so many people don't that they probably rewrote the code to take either format. And that gives me a straight regression line. And I can also, I don't know that this is, I'm not going to pretend that this is good statistics, but I want you to see that you can do stuff like this. I can write different regressions for the different categories, right, two-seater compact min size, just by saying group equals class, which is the same, relatively the same as what you see right there. I'm going to take out that open close, and I'll run that. And, you know, I have a bunch of different regression lines. And maybe I'm not sure about those regression lines. So I'm going to do one more thing here. I'm going to say color equals class. And then it's really clear which regression line is for which category. And this is a really, at this point, a really ugly visualization. But one of the things that it tells me is that generally speaking, larger cylinders end up with lower gas mileage. And the category that's a holdout that's not really true is this one, the two-seater cars. They seem to have a pretty steady gas mileage. It doesn't seem to have a whole lot of difference. There's not a lot of data there. So I don't know how reliable we'll have that. But one explanation we can come up with is that, like, why would two-seater cars with really large cylinders have relatively higher gas mileage than many of the others? Probably because of the overall weight of the car. Two-seater cars are pretty small. Anyway, a really, really quick visualization that demonstrates some, not only some GG plot capabilities, but some sort of loose exploratory data analysis. Happy to take questions as we're moving on there, but let's, there's another example of all the code. Let's move on to bar plots. And you'll see that it works kind of similarly. Now, for bar plots, I want to point out that there are actually two kinds of plot functions for bar plots. There's G on bar and G on call. Here's G on bar. Here's G on call. And the difference is that when you use G on bar, it counts the height of the bar based on the number of observations in the data frame. Whereas in G on call, it's going to get the height of the bar from the value of the data. And I'll try to make both of those clear. So if we start out with this Midwest data, which is just some population data, onboard population data about, and it's part of the GG plot package, about some states in the Midwest, Midwest of the United States. And the question is, display a bar plot for each category showing the categorical frequency. Now, I should have put that in back ticks because that is an awkward sentence. But if I scroll all the way over to the right, there is a variable called category and there are these values. Now, I'll admit to you off the bat, I have no idea what these values stand for. But, and I couldn't find it in the documentation, but it doesn't matter. It helps to, it helps to just know that these are categories kind of like in the previous graph where you had two-seater and van and whatever. They're categories that represent something, they're discrete categories. They're not so very many of them. So I want to make a bar plot and know how many of each of these categories there are. So I start with the Midwest data frame, I pipe that and then into GG plot. And then I open up the G on bar function. And in the aesthetics argument, I only need I only need to put in category, right? This variable right here. And when I execute that, I get a bar plot that tells me the frequency of each one of those categories. The nice thing about GG plot, of course, is that it drew the x-axis, drew the y-axis, it labeled the x and y. We can modify those, but it did all that for us. All right, part B, arrange the bars in order of most frequency to least, which is demonstrates a different tidy verse function and one that will come up a lot when you're, at least when you're making bar plots or doing other kinds of visualizations. You know, just as a general visualization principle, you would want this to be organized, not alphabetically, but tallest to, or most frequent to least frequent. It depends on your audience, but that's a generalized approach. And the way I'm going to do that is I'm going to use this library. You remember when we scroll back up here all the way up to the top, I said that GG plot or tidy verse libraries work well together. So we're using GG plot. We're also going to use four cats. And four cats is the library that helps you manipulate functions. Okay. So the function I want is fact because there are these, because in our language, a category is a factor. And I'm going to use the function fact infrequent. Why is it not showing up? Am I doing wrong here for cats? FCT, I might have spelled it wrong. Okay. And then what that does is that organizes the bars by frequency. And then the third question is make this a stacked bar chart stacked by state. And you would have no way of knowing this, but stacked bar chart in this case assumes that there's going to be a color factor involved. So outside of inside of the AES function, but outside of the fact infrequent, I'm going to say color equals state, the state variable. Oh, I do this quite frequently. And you will too. In this case, color refers to the border around the bar. And I don't want color. I want fill. And that will create my stacked bar chart. And if I needed that, I think that's probably an order. But if I needed to, no, maybe not. If I needed to force the order, I could do that as well. Also with factor with a forecast function. I'm afraid I'm going to mess that one up if I do it on the fly. So I'm not going to do it. But we have another example soon. G on bar. And I think that is this, yeah, that's the same thing. And then so here's G on call. Remember that I said that the difference is that G on bar will calculate will will do its own calculation counting up each observation or each row in the data set. Whereas what if I wanted to use a value that was already in the data set? Well, here I'm going to have to rely on some supplier functions that I learned last week. So I'm just going to give them to you. But I want to alter the data set so that I have not population for each county, but population for the totals for each state. I'm going to do that by first grouping by state. That's here. And then I'm going to summarize. I'm going to create a new variable called state pop and I'm going to sum the pop total. And if I just run those together, oh, what did I do wrong? All right. So silly. I put in the value of the variable, not the variable name. I put in I L instead of state. And those of you who are watching from home, probably some of you knew that instantly. It's kind of hard to talk and think at the same time sometimes. But what I've got now is I've got a different data frame modified, transformed, wrangled data frame. Everyone say it. It has what I want to visualize, right? Instead of being 400 rows or whatever it was before, now it's five rows with a population number for each state. And that's what I want to send to G on call. Because these are the numbers that will identify the height of the bar. So my X axis is going to be state and my Y axis is going to be state underscore pop. And then I can just send that to G on call. And I get, you know, that's the difference between G on call and G on bar. All right. Let's do G on line. And let me check my time. Yeah, I think I'm okay. I think I can make it through all this. G online. Okay. So one of the things that G online is going to help me introduce to you is that I admit I said this before, but GG plot likes the data to be long. We're going to come back to this when we talk about pivoting. But the two onboard data sets that come with GG plot, there's one called economics. And there's one called economics long. If I look at both of those, here's economics. It's a wide data frame, one, two, six, six columns wide, 574 rows. And G on long is 200, 2,870 rows, but only four columns. And the difference is that it turned all of these values into rows and all of these values into rows and all of these values into rows. And GG plot really likes that. It makes it good for iteration. So let's comment out economics. Just work with economics long. And we'll come back to discussing that data format in a bit. Draw a time series line plot of the various values in the in the tibble. Tibble is another name for data frame specific to tidyverse. Use the group argument. Okay. So time series, the x-axis on a time series is going to be the date. And the line or the y-axis is going to be value, what we see here in the third column. And if I stop there, let's just stop there so you can see what happens. Because this line plots, I find them hard to draw. It'll look goofy. Because it doesn't really understand the groups yet, right? The groups were PCE, POP, PSAVERT, as we looked at, of course it's in rows, but there's PCE. And if we scroll through this a bunch, we'll eventually get to POP. I'll go down to the bottom. There's POP. So I want those groups to be relevant. And I can do that by saying group equals something that they're mic-hot. Thank you. Group equals value. No, I'm sorry. Group equals variable. So that I can effectively sort of subset each line relative to the value of this variable name. So when I run that, now I've got looks like four or five lines there. I don't know which one is which, but I got that because I used the group option. I can augment that by also using the color option. And as it turns out, it's a little bit redundant. If I just use the color option by itself, I would get this more understandable plot. So at this point, group is a little bit redundant, but it doesn't hurt anything. All right, so that's making a line plot. It's a time series plot. It drew the time from a date variable type. And it gave me my y-axis labels in scientific notation. And so I want to do, I want to fix that because I don't really like the scientific notation. So I'm going to introduce this concept called scales. Scales is a way that usually you can affect color or other aspects. Scale, why there are whole bunches of scale arguments continuous because it's a continuous variable. And it has an argument in it called labels. I'm not seeing it off the bat, but if you go into the full documentation, and I know that you wouldn't know this off the bat, I'm just showing you because you'll run into these things and you want to do them. And here I'm going to give the labels argument a different library function. There's a library function called scales. And I'm going to put in the use the comma function. And so you can watch how the y-axis will change when I run that. I now have readable scales. Building on some of what I know about tidyverse, I can answer element C, which is eliminate the pop variable and show the remaining four variables as a faceted wrap. So let's get that right. I want to eliminate all of the rows where variable equals pop. So I'm going to do that with a tidyverse supplier function filter where variable does not equal pop. And then pipe that new smaller data frame into ggplot2. When I run that, I don't have that really high value of the population. I'm only doing this so you can see how you can manipulate your graphs. And then the other thing that is nice about manipulating graphs is I can use this thing called facet wrap. There's two really useful commands, facet wrap and facet grid. Facet wrap is a little bit easier to deal with. But what that will do is create subplots, one for each category of the variable, each value, categorical value of the variable data, variable variable. Right? And so the syntax for that would be to use the same variable label. And when I run that, I now get four plots with the same scales all put in together. And that's a really handy way to look at multiple pieces of data. I might, in a case like this from a visualization standpoint, I would put a little more effort into my labels. And I probably eliminate my legend because it doesn't really deliver anything extra. It's just more stuff to look at. So it presents a little bit of what they call cognitive overload or cognitive load without delivering any value. Sure. Thank you. Actually, at this point, could you also tell us, so for instance, in the y label in the y axis right now, every other grid line is labeled. Is it possible to get every grid line labeled? Yeah, you can control the breaks and you can control the every grid line. I find it a little bit confusing to do, but so I'm not going to try and do it online because I'll probably get it wrong and I'll get stuck in a rabbit hole. But let's see. Breaks. There's a whole bunch of break arguments listed there. And some of them are arguments within like scale. And what I'd like to do is kind of skirt over your question by saying, yes, it's totally possible. I don't have those arguments at the very top of my head, but they're not hard to look up. And if you get into that situation and you're stuck, please reach out to me. I'll be happy to. I can find the answers relatively quickly. Thank you. Totally possible. You can control all of that stuff. As a matter of fact, as long as you ask that, I'm going to put in here, it might be the very next thing I'm getting to. Or if it's not, I should cover it anyway. There are other things that you want to do with your visualization, like make better labels. So the function is labs. And there's a whole bunch of arguments that go into that. Title equals subtitle equals x equals y equals and source. Just to name a few, let's go ahead and fill those out. So you are C. No, it's not source. It's caption. Caption equals. So I could do this and say state population or state demographics. Subtitle. I don't know. Stuff you should know. x-axis is I'm going to actually eliminate that because I think that that's, I'm going to eliminate both of them because I don't think data is particularly explanatory here or necessary, nor do I think value is. But, you know, you could put in the y-axis, you could put something like, you know, in centimeters or whatever, whatever is relevant. And then for caption, I'm going to put my source information, ggplot2. What is this? Economics underscore long. And of course, that could be a much more, a much more verbose. But when I do that, now I've, I've managed a bunch of other aspects of my graph that I normally have to do. So that's a function that's really worth knowing is the labs short for labels. There are other ways to do that, but that's, that's the one that consistently works and the one that I tend to use all the time. All right. So that's kind of like a whole bunch about ggplot and we have 45 minutes left. So what I'd like to do we'll skip the interactive and we'll go straight to the visualization. Give me a second here. So where is my regression? O2 underscore C regression. I'm going to go make that the sole screen again. So the libraries I'm going to use here, I actually, I'm actually using just applyer and ggplot, but it would be actually, what I normally do, I normally don't call those libraries out specifically, I just call library tidyverse and pull in all eight symbiotic packages because it's just simpler. So but I'm going to use these two, this is for data wrangling, this is for visualization. Broom is the tidyverse package that is helps you manipulate models and modern dive is actually really a, it's a package that is based on Broom that is really used for teaching regression. So if you're not super clear on regression, you might want to track down, here's a link to the modern dive book by Ismay and Kim. It has really nice explanations and working examples, but if you're really good, if you're very comfortable with models, you might go straight to Broom and just ignore this all together because it doesn't bring anything more than what Broom does. So very quickly, a little bit about kind of base our old school models. This is the way somebody would write a, let me execute this function so that I have all the libraries loaded. This is the old school way that somebody would write a linear model to do a linear regression. So LM is the function, you would, that would be the predictor or the response variable, that would be the predictor variable. So are we trying, we're trying to predict mass from height and then you identify the data frame. And typically people will put that into an object. So that's the, that's the assignment variable. And then just look at the object. So normally, for people who are used to models, this is what you'd see. And then this is fine, it's got information that you need. Sometimes people will follow that up with the summary function of the same object. So let me run them all together. You can see that's my little separator there that doesn't have any value other than to separate the two outputs. It gives me some more summary information about my model. My r squared, my p value, my adjusted r squared, they're all there. But these outputs are a particular kind of r object and I think it's probably technically a list. I find them personally hard to work with. So if you're an old school r person and you know how to get the third, why am I blanking on that, quartile range number value of negative 17.73, you don't have to change. But the aspect of the tidyverse, one of the aspects of the tidyverse that I like is they use this mental model that kind of everything is easier to work with if it's in a tidy grid data frame what they call a tipple. And so that's what Broom really does. Broom really allows you to put all of these values into data frames. All right. So let's move forward and see how that works. From the modern dive library, there's a dataset called evals. And I'm going to select four variables out of that. And we'll just take a look at, well, apparently I'm going to take a look at both, but I just really want to look at this. It was four variables. So there's an ID for each observation. There's a score for each observation. There's a beauty score for each observation and age. And what they're trying to do is predict, this is data, I don't know if it's real or made up, but it's data for, I think it's real, from instructors who are rated on their, you know, I guess the effectiveness of their course. And they're trying to figure out if the age of the instructor and or the beauty quotient of the instructor, which is a subjective variable, influence the score of the quality of the course. Okay. So we can do some really basic, base R summaries with those values where we, if we pipe it to summary, we get the min, the max, median, the first quartile range and the third quartile range. So between the two, we can figure out the intro quartile range. Another exploratory data analysis library that I find really helpful is from the Skimmer library. And we can run the skim function over that same data frame. And in this case, this data frame, remember, it only has four variables and they're all numeric. Skimmer will give you more information than, than just the numeric, but there's only numeric here. And so it gives you a mean and a standard deviation and the quartile break points. And one of the things that I really like is it gives me a little what's called a spark graph histogram at the end of each variable. So I can see the distributions of scores, I can see that score is right skewed, I'm sorry, left skewed and heavily weighted towards the upper end of the value of that score, right? I mean, you can see all that there. But if I want to visualize that, that would be a, that's a quick way to visualize the distribution. I'm going to skip over correlation for a minute, unless you guys bring me back to it. But you can get correlation scores on all of these things. But I wanted to talk about regressions in general. So here's how modern, here's how it works in with broom and sort of a modern tidy verse aspect. I'm doing that a similar model, like I showed you before, where I'm going to try and predict score from beauty average, right? This is the response variable. And this is the single variable predictor variable. And the data frame is identified right there as eval chapter five. And I'm going to put that into an object called score. Okay, I don't see it there because I'm not calling it. So let me redundantly put it back there and call that again. And that's very similar to what we did at the beginning. But with broom, library broom, we get this function called tidy score model. And in modern dive, these are roughly equivalent synonymous. You don't have to read both if you're, again, if you're, if you're more comfortable with regression models, probably just use the broom functions. If you're not, maybe you want to start with these teaching academic functions, but they're this one is based on this one. So if I execute this model inside of those functions, right, I'm going to get a much tidier response of my model, right? So there's my intercept. So how does that work, right? For every every unit increase in beauty average, the score goes up by 0.067 pretty certain that and I'm reading that properly. I hope I'm not making a big mistake there. But I'm getting my p values and those are again, this is the second data frame for learning. So you can, I know most of you would want more precision than that. So again, if you use the tidy rather than get regression table, that's this table, not this table, this table, same basic kind of stuff, there's your scientific notation on the p value. So you can get more precision. All right, so that's all tidy really does tidy just tidies up the model into something that you can then use your, let me do this again, you can then use your deployer verbs to easily pluck out different aspects. So for example, if I do this, and I say select, I could say select p dot value. Let's see if that works. Yeah. And if I could say pull p value and get them in a vector. And so I have now I have at my disposal because it's a data frame. I have all of the other deployer verbs to manipulate this data frame glance and regression are sorry glance, glance here and augment do similar kinds of things. They tidy up the response. So in the case of glance, I get here I have my R squared my adjusted R squared. I've got my p value again. This is actually not glance. Sorry, this is this one. So let's go back to the first data frame glance. Same kind of information, a little more precision, scrolling to the right and much easier to get a hold of each one of those individual values. Right. If I put this in, oops, control shift. That's not what I wanted to do. My model do that. All right. And then comment that this out for a minute. Then I could say dollar R R underscore squared. I should be able to run these two together. Why did that not work? R dot squared. Thank you. You know, an easy way to get the sort of single element of a of a of a numeric vector augment really just among other things gives you your residuals. Right. So there's your there's your residuals right there going all the way down for every value. I'm not sure if you want your residuals. Really, we're getting beyond the kinds of things that I do with programming. So if you have a if you I would say this, remember that we have some people who know more about statistical modeling than I do. I'm just trying to point out the easy ways to manage the models. But if you have a question, I'm happy to try and answer it. I'm going to check my time. Got 35 minutes, which should be enough. Let me look at my notes. Looks like I said I would do interactive visualizations next, then pivots, then joins. And I feel like we got plenty of time to get all this done. So I want to make sure that I give people ample space to ask questions. If I can, I will certainly try to answer them. I have a question that's going back a little bit for a scatter plot. Is it possible to label the points on the graph? It absolutely is. So let me see. How do I want to show you that? What I would use in a case like that, I'm going to go back to layers in the documentation. And there are some things called geom text and geom label. Here's geom label, there's geom text. And so I would refer to those, to the documentation on those of how they how they get implemented. But down at the bottom, you can see that you can turn any point into a label itself. And the difference between text and label is that's text and that's label. So you can have a box around your label that makes it sometimes easier to see and sometimes less easy to see. You can see that you can affect lots of aspects. Now these look very busy. So of course you would want to filter your responses depending on what you're doing. And you can label not only scatter plots but pretty much everything else. And there's some information in here about how you can annotate the graph even without the data values. I'm pretty certain there's another helper library that I is not coming to the top of my head. But if you will send me an email, I will and it's an interest to you. I will send it to you. But it will help you do some even, like it will automatically pick out like a few high points so that you don't have to go through the trouble of filtering all your data. But if I wanted to, let's see if I can do this. Geom text. Let's go back to our example and get up to the top. All right. Let's do this one. Gonna make a new, all right. Now if I add a layer, geom underscore text, I'm not certain what's going to happen here. Yeah, it requires something else. Geom text requires the following missing aesthetics label. Oh, right. That's easy enough. So AES label equals, well, we'll do class. Actually, let's do something different from class. Let's do model. This is going to be pretty ugly off the start. But now I've got a, oh, one of the helper libraries, one of the things it will do is it will it will repel the label from the data point and draw a line to the data point. It's a really nice feature. I have to figure out where that, what that library is. But so one of the things I might do is I might, for example, do something like this. Data equals, I'm going to say dot, which is, which is tidyverse shorthand for the prevailing data frame. So I could write this out in longhand, MPG. Filter where class equals, let's do two cedars. Now this should allow me to label only, close parentheses right there. Data not date. Sorry. Oh, did I write date? Thank you. Thank you. Data. Thank you so much. And see, you can see, again, I should do a little more manipulation, but it's a really easy way to label just, in fact, maybe I don't want the label to be, in this case, two cedars. But I wanted to actually, I'm sorry, just the model. Maybe I do, in fact, want the label to be class because I want to make this point that two cedars are weird, right? Or they have a strange unique characteristic in my scatter plot. So that's a, I hope that helped. Yes. Thank you so much. That's very helpful. Good. Yeah, please do send me an email and I will, I can happily find a library that even augments that makes it even, even better. Any other questions? Sure. Okay, so let's go on. Well, actually, I'm sorry, I was looking over my screen. If someone was about to say something, please feel free to jump right in. Yeah, John, can I ask a question? It's not about specifically visualization, but how would you do to work in a group or in a research with another person to share your code? Would you send the whole project? How would you work together? That's a great question. So I'm going to minimize the screen. Actually, let me get back to that and minimize this screen. I'm going to go back to this and say that you could watch this video on projects. Projects is definitely a big part of it. That's the, that's the RStudio way to make it really easy to share your RStudio project with somebody else. And when I say with somebody else, let me say that I'm including, when I think about sharing, the person that I share with most really is me, right? Because I have one computer at work and I have a different computer at home, and they don't have the exact same idiosyncratic file system. And I don't want to have to edit the script every time I am working from home. I just want to, if I just have the same project, it's going to work every time. And then in terms of synchronizing the changes, that's where version control and Git and GitHub come in. So I'm going to go back up here click on our workshops and show you that I have, I'm actually doing a workshop in about two weeks that's going to be updated. This is a little out of date. I'm going to update it in the next two weeks and do the workshop. Let me see when. Find Git. Yeah, this one, February 17th, where we're going to talk specifically about how you can do this with R. But what I would do is I would use a library called use this. Let me put that into the chat so it's really clear what I'm saying. Library use this. And what use this allows you to do is it allows you to use R as an interface to GitHub. So you have to have Git installed if you're on a Windows machine. I think Mac and Linux music users may already have that installed by default. And what that enables me to do is then push my project up to GitHub. And I can control the permissions. I usually make my things widely available, but you can make them private. And let's look at a different Git repository. So we can see some things that are being demonstrated here. Actually not seeing the thing that I want to see. Not sure why. Maybe it's because I'm not logged in. Anyway, oh, it's right there. Sorry. It was there. I didn't see it. What this tells me is that I have made 95 changes to this repository. And so you can actually time travel and go back to any previous version that you want. But it's always presenting the most recent. And you can do things which is called branching. This has only one branch. But if I had my project exactly how I wanted to, and I wanted to do an alteration, but I didn't want to change the pristine almost ready to publish version, I could branch it, make all kinds of changes on the branch. And if I like the branch, I can merge it back in. If I don't like the branch, I can just leave it or delete it. And it also allows me to collaborate with others. People I know, or even for that matter, I can collaborate with people I don't know. They can do something called a fork. And they can get my same code, fix something that I maybe did wrong, and send it back to me as what's called a pull request, because they think that it can make my code better. Again, that's all optional. All that social coding is optional. But for me, the easiest way to share, I shouldn't say this, shouldn't say the easiest. The most full-fledged way to share projects, excuse me, is using Git and GitHub and version control together through RStudio. But it does take a little bit of setup. And I would say it delivers on that effort. But if you don't want to put that effort into it, you can actually just share, let me minimize my screens. Remember when I started out here, I expanded this project as a folder. So if I go up one directory to desktop, let's just do that desktop, there's that folder right there. And because it was started as an R project, it's got this R project file. So I can just zip or not zip. I can put this in. I could send it to somebody through email. I could put it on a disk and send it to them. I could put it in Dropbox or Box or OneDrive. I would caution you against keeping your projects in OneDrive, because I actually think that they OneDrive and Dropbox, that they get a little wonky when you start trying to control version inside of a file synchronization system, which is one reason to use Git and GitHub instead of those. But the dead simplest thing to do is to take this RStudio project and just share the whole folder, just send it to somebody's attachment. And they'll be able to do work on it, or you can send it to yourself over email and go home and pull it out of your email. You'll be able to do work on your own project. It'll work every time. Great question. Thanks for asking. Okay. So onto interactivity. Oh, this is cool. This is actually fun. Because it's fun because it's mind boggling how simple it is. Let's go back to visualization script. Oh, not that one. Do I just, am I on the right spot? I wonder what you guys are seeing and what I'm seeing. Yeah. So I want to be in my, no, this is right. This is right. Let me just get down here and now it's not. Sorry. Go back to, let me just open this back up and demonstrate again. So I want to be in these exercise area. I'm going to open up the exercise project. And I want to open O2Viz. Let me do O2Viz answers. And let me scroll down to the bottom just to make sure I have what I want. And what I'm going to do is I'm going to run all so that I'm sure that everything runs the way I want. And you can see a progress bar down here, which in case you don't know, if you click on that progress bar, it'll take you to the exact spot where it's executing, which is kind of cool. So here, let's make something visual. So let me start this way. I'm going to start this way. Here's a ggplot plot that I want to make. And what it is is a stacked bar plot of the visualization of the Midwest data frame that we were looking at earlier. And you can do this with any ggplot object. It requires that you run this library called Plotly. If you haven't installed the library, we didn't talk about this much, but you can always, library also known as package, you can always click this packages tab, click the install tab, type Plotly. And I'm not going to do it, but if you click install, it might take some time. So maybe you don't want to do it either at the moment. But once I install it, I only have to install it once, but I have to load it every time I'm going to execute it. And so I loaded it. And now I can make this thing interactive. So this is a standard ggplot from this code right here. Nothing special, standard ggplot output. If I assign that ggplot output to an object name, when I run these three lines of code, of course I don't get any visual output, but I now have that object in my environment. So if I want to look at that plot, I can call that object name that I just assigned. And then if I want to make it interactive, I just put that object name inside of the function ggplotly, right, part of the Plotly library. So when I run line 132, it changes the plot a little bit, but it gives me a whole bunch of tools that I can turn on or off. I can put this into a web page and it's instantly interactive. I mean, I do a lot with it in terms of reports. This is just a standard bar plot, so it doesn't have everything I would want right off the bat, but the simplicity of doing this is amazing. And then I can alter it. So some of the features are that I now have flyout windows for each one of these variables, and I can control what shows up in those flyout menus. Again, this is a tool bar that I can turn on and off, but I could get flyout menus for the whole bar. Maybe I want to drill down and look more closely at some of these bars that are really similar. So I can subset the plot and look more closely at what's going on. I think if I click the home button, it'll reset everything. Yeah. Now, this particular feature is not all that helpful in a bar plot, but if you have other kinds of plots that can be helpful, I can instantly turn some parts of the graph off and selectively display other parts. Again, it's not all that helpful in a bar plot, but any ggplot, I mean, I think there's maybe minor exceptions, any ggplot visualization that you make, if you turn it into, if you put that plot object inside of the function ggplotly, it'll instantly become interactive. That means you can shove it into a dashboard without learning shiny. There's a library called Flex Dashboards that allows you to, that works with R, that allows you to make interactive dashboards without going through all of the hassle of learning shiny. Now, I'm not saying don't learn shiny, but that's like a whole other thing to learn. If you're starting out, maybe you don't want to start there. And if you don't know what shiny is, it's just another, it's just a dashboard thing. So that, that's actually the long and the short. There's way more that you can do with interactive plots, but that's the part that I wanted you to know. And then you can look at the ggplotly documentation to learn more about how you manipulate that. I want to actually grab this, let's see if I can find it. John, while you're looking, do you think that would work on a virtual poster? Oh, yeah. Yeah. So what I would do, I'm going to, I'm going to put a link in the, I'm going to put the, by the way, just real quickly, I'm going to put the post workshop survey into the chat, but you should get an email in about an hour. But I just want to make sure I put that in there in case anybody is still inclined. So I wonder if I can, I know what, I know how I can demonstrate this for you into a virtual poster. So if I go to flex dashboards, flex dashboards are, and I've, it's going to take me here. And there's some galleries. Some of them are shiny, but not all of them examples. Here's an example of a ggplot2 gallery that does not, or dashboard that is not shiny, it's not using the shiny code base. And you can see that they're all plotly interactive plotly graphs. And there's two different pieces. But yeah, you could, you could turn that into a an interactive dashboard with the knowledge that you already have. It's, it's just going to extend your, your wizardry that much farther without having to like learn a whole new thing. Thank you. Yeah, I really, I'm actually a big fan of flex dashboards. I think they're what you can do with them. And you can learn it all here. The nice thing about flex dashboards is that it is extensible, right? If you learn it, and then you get even more into dashboards, you can, you can bring shiny interactivity into your dashboard, but you don't have to. So you can control layouts, making different kinds of dashboards that shows you on the right, like one column on the left and two on the right, or four, or you can have a one on the left with one on the right that has tabs. And it's, it's pretty straightforward. It's well documented. And so yeah, so that's interactivity. And then I was going to cover pivots and joins and pivot got the next most. So let's see what we can learn about pivots in the time that we have left, which is, I say we've got about 15 to 20 minutes. So again, especially because we're winding down, if there's something you want to know that that I haven't answered, please feel free to just shout it out. Pivots. Okay, I'm going to go back to this one and go to O2B pivot join, make that full screen again. And the whole thing about pivoting is that you're changing your data either from wide to long or long to wide. And the use case that we were talking about earlier, one reason to do that is because long data works really well with iterating and ggplot. Long data works really well with a lot of things, but there, you'll definitely, I mean, it's not the only way to organize your data and you'll definitely run across applications that want wide data or packages that want wide data. So, you know, you can, you can apply this judiciously. So I'll make sure that my tidyverse library is running, that will include the package called tidy are, which is the package that has the pivot libraries. Now, I say pivot because a lot of people have heard that, but the name has changed. It depends on what application you're using and the name has changed even within the tidyverse, whereas you may have heard spread and gather, but recently, they had spread and gather. Recently they've gone to pivot longer and pivot wider. These pivot longer does spread and pivot wider does gather, I think. They're not 100% identical, but basically do the same thing. And I find pivot longer and pivot wider easier to use. So we're going to start with importing some data that I have in this repository. Right here, there's some data called favorability, 538 favorability popularity. 538.com, if you haven't heard of it, is a survey website. Some statistical wizards live over there. And one of the surveys they did is they asked people what Star Wars characters they like best, and they have a lot of their data sets up on GitHub and you can use them. And so I'm just using a standard tidyverse approach, skipping the first 11 lines because there's some provenance information in there where the data came from. And reading that in and ending up with a very simple table data frame of 14 rows, 10 that you see here, characters rated, higher rating means more favorable, lowest rating. Not surprisingly, Jar Jar Banks gets a really low rating. And Emperor Palpatine gets the lowest rating, which I think he deserves. So, oh, wait a minute, that was all about joins. Let's go to pivots. Sorry, got off track there. Back to economics, economics long. The point of this was to demonstrate this is the wide version of the economic data. And this is the long version of the economic data. So if I run both of these together, we talked about this. But the wide version is six columns wide, 574 rows. And the long version is 2870, 2870 rows, only four columns wide. And these days, because disk space is a lot cheaper than it used to be, having the kinds of redundancies in the long data is not generally a problem, and it leads towards some efficiencies later. So that's one of the reasons to pivot longer. So if we were going to take this data, I mean, GG plot is giving us these two data sets. So we don't have to run the pivot command. But if we wanted to take this data and make it look like this, we would use this pivot longer function. And it takes a couple arguments. The first argument, let's look back here at the wide data. The first argument is what part do you want to pivot? So what I'm saying is I want to pivot everything from the PCE variable to the unemployed variable. So that's just a deployer kind of select statement. But I don't have to use the colon, right? I think I could name them all by actually shouldn't say I think because I don't want to jinx what I've got going here. But you can always read the documentation. You can select several columns. Or you could subset the data so that you just get what you want. So that's what I'm doing is I'm saying this is my range from PCE to unemployed. And then I'm going to take these variable names and turn those into a column. And I want the column to have a name. I decided to give the column the name variable, which makes it hard to discuss. But this column name variable has these categorical values. That's what I want to do. So I'm taking this top row and making it a column called variable and then taking all the values and putting them in a column called value. And that's all that PivotLong, PivotWider does. PivotWider does kind of the opposite of what PivotLonger does. But it's really, this is something that almost certainly will show up as you use R more. And I don't want to, it can get a little boring to read through, but it's really important to know that it's possible to do. And if you look at the PivotLonger and PivotWider documentation, they'll give you some nice examples that you can follow on how it all works. So I did not plan to get into that in any more detail, but I'm happy to try and field a question. In my experience, some people find this confusing and that makes sense. So if I can demystify it now while we're all together, or if I can certainly try, please feel free to open your mic and ask a question. And I see from the clock that we're getting really, really close to the end. So I'm going to go back up here to the last bit, which was joining data, right? So I have this favorability ratings table. And what I want to do is I want to join that with the onboard Star Wars table, which if you haven't seen it, is part of DePlyer. Oops, not starts with Star Wars. And that's an 87 row table of characters in the Star Wars films. And it's a bunch of information, you know, there's sex, their gender, the birth year, skin color, hair color. What I want to do is I want to bring this favorating column and make it part of a wider table or a bigger table. So I'm going to need to do some kind of join on common, on some common value between the two data frames. And fortunately, these two tables have the same name. And so I'm going to use the variable called name as what's called my join key. So anywhere where I find Luke Skywalker in one table, if there's a match in the other table, I'll bring over all the other data. Same for Han Solo. If Han Solo matches over here, I'll bring over all the all the other data. That's what's called a left join. Where's my, there's a graphic of the different kinds of joins you can do in DePlyer. I will throw that in the chat. And you can look at that at your leisure, but we're going to do what's called a left join, which is to say if there's a match in the wide table, bring over what's in the wide table. And, and that's done this way. I'm going to start with the left hand data frame, which is called favoratings. And I'm going to use the left join function to join it with the right hand data frame. Now, I have a commented version up above, which is the longhand version. Our studio, our left join is smart enough to know that if there's a common data frame, you don't have to name it out completely. But just so you can see it, like if you had one that was called name and one that was called full name, you could explicitly identify the, the join key. But that's commented out. And that's all there is to it. You just do a left join and then I'm going to arrange that. And what I get now is I get my favorating as part of my part of my, I get actually what I'm getting is I'm getting all of the other all of the Star Wars data where there was a match. So where there's not a match, I still have my favorability rating, but I'm also reduced to a 14 row data set, which is what the favorating was. If I did that the other way, you would find that it looks a little bit different favorating left join. Oh, I want to do this. Star Wars left join favorating. If I ran that second one, this one hasn't gotten any bigger. It's 87 rows. And then the, what I've added is the favorability rating, which is going to be all the way over to the right. And you also see a lot of NAs there because not everything is matching. And not everything is matching because we actually, this example uses the worst possible join key. Ideally for a join key, you want some kind of alphanumeric unique identifier, right? So this is easy to think about if you imagine this is a bureaucratic HR department where I am the HR director for Star Wars characters. I'm going to give every one of those characters a unique ID that's unique to them. And then if I had those unique IDs, the join will be very precise. If you have to use alphabetic keys, sometimes it works. For example, state codes in the U.S. are very precise. NC for North Carolina, although there's a lot of variation, right? You could have NC or you could have NCAR or you could have lowercase NC. So anytime there's room for variation in your join key, there's room for error or mismatch. So that's why having an ID, a precise ID is great. The worst, absolute worst thing to join on probably is name because names can have so much variation. And in fact, you see that here. I ran some commands, anti-join commands so you can see it. But if I, you don't need to really worry about these commands, but what's not matching? In this first table, it's C-3P0. And in the second table, it's C-3PO, so no match. In one of the tables, it's Emperor Palpatine versus just Palpatine. And that all just goes to the fact that computers are exceedingly literal and names are really very flexible and that's why it's not a good join key. It just means you have to do a lot more data cleanup if you want to use names. There are techniques. Not all that. I mean, you could come up with them yourself like make everything lowercase, take out the spaces, take out the diacritic characters. There's all kinds of things you can do that, again, if you had a code to match on, you'd probably be better off to start with. So that's everything. I think I actually went two minutes over, so apologies for that. Thanks for staying so long. I can say a little bit while longer, but at least want to say thank you for your attention and feel free to reach out to me if you want to do a data consult. I hope it was helpful. Just a quick question. The anti-join, what do you use that for? Okay, great question. I know I went through that really fast. Left join is what most people end up doing. A left join is where you take everything from the left table and if it matches something in the white table, you get that. The anti-join is saying I only want to know the stuff that's in the left table. And so I was using those against each other to figure out what didn't match. All right. Special thanks to Drew Keener for hanging around for so long. Thanks, Drew. Yeah, of course. Well, you guys are free to go anytime you want, and you're also free to un-mic and ask more questions either way. If you don't ask a question, I hope to catch you again at another workshop. Thank you. Thank you, Joe. Thank you. Yeah, that was great. Thanks, Drew. Appreciate your help.