 But good morning and welcome. My name is John little and I'm the data science librarian in the Center for data and visualization sciences, which is a department that's embedded in the Duke University libraries. And this is part two of a two part workshop. You may have registered only for part two somewhere in the registration that asked that you'd be familiar with the stuff that we covered yesterday. And so we may move a little quicker today but everything that I've done is recorded so if you cut if we cover stuff that is unclear to you I'll give you some tips on how you can kind of get up to speed. The next workshop is, we're going to cover today is visualization with gg plot to, which is a library part of the tidy verse packages in in our. And we will cover inner interactive visualizations as well, mostly static and visualizations but then how to turn those, there's a quick step on how to turn those into interactive visualizations will cover pivoting your data. And the main wide or tall will cover. Joining data using a function called left join. And we will cover a very basic introduction to doing linear regressions in our and just to remind you about preparation work for this semester, but all the workshops are recorded so if you go to online learning. You can get. Of course it's for some reason the web server slowed at the moment but from this online learning page, you will get links to recordings of previous whole workshops covering things like spatial analysis and reproducibility and get and on and how to do good visualizations and charts. You can see that you can for example under data science you can expand the R and tidy verse, and there's a whole bunch of stuff there but there's things in every category and what will share there is slides if they're available data, if we can share it. And video recordings, streamings of previous workshops so you can always go back there and then you can, you can look at these. In addition to that I host. What's called the RF what I call the R fun site, which stands for are we having fun yet which is sort of a predilection that the R community has with the name of their programming language are they like to, to use it and make fun of it and use it in innovative ways. But this site includes little modules to all kinds of different workshops different things that you can do with our, you're welcome to use that that will also include video and code and slides. There's a section here on ggplot visualization, as well as we're mostly working off of this first module quick start with our so people who were here yesterday would have already looked at these two embedded videos. And today we'll cover this one and, and this is really just a repeat of the last couple minutes is from minute 20 to like minute 25 is about regression. And the section right here. You know if I say things about joins and you want to brush up on it, or it's been six months and you want to remind her you can go watch this brief video or there's more information about assignments and pipes or how to use our studio ID or how to install packages or what are marked down, which we talked about yesterday will talk about much today. There's more information from all those plus this link to this playlist which includes all these videos at the bottom. There are two links to full recordings of the part one part two workshop, not necessarily from yesterday and today but full recording some of the less. Let's see, I think I covered that. Oh, I like to start out if you'll give me your attention I like to start out with land acknowledgement. So, I would like to take a moment to honor the land in Durham, North Carolina. Duke University sits on the ancestral lands of the chicory the Eno and the Catawba people. The institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land is born witness to over 400 years of the enslavement, torture and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization, and to rewrite the erasure of indigenous and black people. Black peoples, there is value in acknowledging the history of our occupied spaces and places and I hope we can glimpse an understanding of these histories by recognizing the origins of our collective journeys. So, thank you for listening to that. For those of you who are just logging in this. That's a land acknowledgement, but we were really not going to talk about social justice issues in today's workshop we're just going to talk about our and how to visualize how to transform data and how to do some quick regressions. All I asked really is that you think about injustice and if there's something you can use from these workshops that will help you fix an injustice, then that would be really that I couldn't ask for anything more. But moving on, this is who's in the class today from by graduate status or academic status and from various disciplines looks very similar to what we had yesterday. This chart was done, obviously, but in our by pulling in some data from the survey form that was done on Google Drive. Google forms is what it's called and then you can flip a switch on Google forms that will generate a spreadsheet of data, and then you can use our to ingest that data and write some code to, you know, to visualize it or to analyze it or whatever way you want. We're going to talk about how you generate a graph like this today. And one of the things that's kind of nice about Gigi plot is that it is a whole grammar for how you can describe graphics, right. So we're, we won't start out with this full complexity of the grammar, but notice that this plot, and this plot are the same. The only difference really is one line of code right here, which totally changes the look of the entire plot. Right, it puts it into facets. And that's one of the things about Gigi plot is that with this grammar you can start to describe layers of graphics. If you can describe one kind of plot, for example, a bar plot or a scatter plot, then you can use that to describe all kinds of other to describe and generate other layers, different kinds of plots so regression lines or violin plots or box plots. It all starts to work together and that's a really nice feature be web browser over here and open up the first GitHub repository and the other GitHub repository is right there. So GitHub repositories, we downloaded these yesterday if you download them yesterday, you do not need to do this again. If you're new to this workshop. I'm going to do this twice. The steps are important. You can use our studio to negotiate all of this to where it's really convenient, but that initial configuration of doing that is is a little bit beyond scope of today. So instead for downloading these repositories, which are our studio project folders. We'll just do it the sort of the, the always works method. Right, so opening up the R fun flipped repository. I'm going to click on this green button. And when I do that, where it says code when I do that. There is a context menu and I'm going to click download zip. And essentially what we're going to do after it's fully downloaded is we're going to. Anyway, we will expand these repositories in just a moment, which is important so I'm going to flip over to the other repository. Do the same thing click on code and then click download zip. And these will download to wherever your system puts download files. Most people generally know where that is. So in in my case. Thanks. Nice. So, I know in my case I'm downloading them to the download folder and I can just click on this context manager menu and click show in folder. And that will bring up my file manager, which in Windows is called the Windows Explorer. You'll notice I've done this a few times. Now this next step is actually also important it's important to expand the zipped compressed folder, because we may save things back into that folder. Right, so I'm not 100% certain how you do that in Mac world but I've never had any Mac people ask me about this so I'm assuming it's pretty pretty convenient straightforward like you just double click on it. But for Windows people, you can right click on that file and choose extract all you may actually you can put it anywhere you want. Let's go ahead and put this on the desktop and select folder and extract. If you have a wind zip or a seven zip or something like that it may work a little bit differently but in any case right click and extract all. I'm going to put that on my desktop and maybe give it a different name. Well it's going to do the same thing but it'll be fine. I'm going to click skip these files. And hopefully we that won't cause us any trouble but now what I have on my desktop is I have those two folders. These folders are our project folders and you would have in the email gotten a got an email that would have directed you to this particular for GitHub repository, you may not have downloaded that data. But once you download it in order to open it up as an R studio project you can find a file called workshop underscore our fund underscore flipped, which is a our project folder. We're going to start there however we're going to start over here in the exercises which is not something you've seen before. And this is slightly different view of the files but you're just looking for an icon that looks like this looks like a bar with a box in it. Double click on that. It will load this R studio project into its own discrete instance of our studio. I think I may have yesterday increased the the the font view of my R studio I just want to double check that you don't have to do this part. We're going to start there at 150% that should be fine. Hopefully you can all see this well. And yesterday we covered zero one a supplier which was the plier is a package. Part of the tiny verse similar in some ways to GG plot which is also part of the tiny verse. The plier is used primarily for manipulated and wrangling data. Today we're going to use GG plot which is primarily for generating plots of grammar graphics. And we'll also use tidy are, which is useful for tidying data but mostly for pivoting. I think pivot is in tidy are and left join is in the plier. Anyway, it doesn't really matter because we'll load just one will just load the library tidy verse which when you load tidy verse it actually loads eight related tidy verse packages at once. So tidy verse is the is sort of the conceptual name for what they refer to as an opinionated set of packages that all work well together they use the word opinionated intentionally because what they're trying to say is, we think those people who develop tidy verse I think this is the best way to use are and we think that it is, and they've certainly put a lot of effort into making it consistent across packages not only for the how the functions work. But for the documentation which is all online, but they also recognize it's not the only way to use are there are lots of people who use base are, and you can use any one of these tidy verse packages with or without our studio, and with or without the rest of the tidy verse right so if you're a base are person can still use ggplot without using the rest of tidy verse, if that's what you wish to do. We're going to start however today in this 01B biz and and EDA for exploratory data analysis. Click on that one file right there. And that should bring up a file in my editor, you have probably something similar. I'm going to drag this up to the top so that we're looking at that and I'm going to make this so that it's just one screen which hopefully will be easier for you to see. And just quick review the top seven lines are part of what's called a YAML header of an R Markdown script. And that since it's an R Markdown script that means for the most part I can interspersed code with pros. Some formatting code here that is actually referred to as R Markdown. If I click on this little compass icon in the upper right hand corner of my editor. I can actually shift to a visual editor which some of you may like better. It may require it may download a package in order for that to happen. But I'm not going to. So you can see the structure of the R Markdown right thing. Some things are bold. Some things are links and there's a bulleted list, and there's second level headers and then interspersed amongst all of that. Formatting are different code chunks these little gray boxes, each of which you can run separately just by clicking on a little green button. I'm going to actually switch back to the text view because it's more comfortable for me. I will point out two things. If you go under help and click Markdown quick reference, you can get some information about how to how to use these typesetting codes. And also if you wanted to create a new Markdown document, you could just click on this little plus right here. And depending on your needs is going to depend on which one of these you choose you're choosing you're creating a new file a script file. And for my part, I almost always start in our notebook and then I may choose to change the output to something else the output is designated right here. And you can make that output be a PDF file or a web page or a dashboard or the list, an ebook, the list goes on. We won't cover all of that today there are modules in the R fun site about some of those things, and I'm happy to answer questions about those. And we will cover today. Oh, one more thing. Let me expand back out my to the four quadrant view. There are three libraries that we're going to run. And one of them is gapminder which you may or may not have, you may or may not have skimmer. If you loaded this editor, and you don't have those libraries. There's a little yellow bar right across the top that will say you don't have these libraries do you want to install them. And that's the most convenient way to install them. But if, if you don't see that you can always click on packages. And you can click on install. And for example, you can type gapminder and go ahead and install that I'm going to click cancel because mine is already installed. And switch back to the full view. And when I run when I execute this code chunk. That's going to just load these three libraries remember that in our, you only have to install a package once but you have to load a library for every script that you're running. And I get some feedback from this execution of these three lines. The main part here is that this is telling me that when I load the tidy verse. I'm really loading eight libraries, one of which is ggplot to another which is to plier those are sort of the two main ones that most people use and are aware of. And there's also reader which is for ingesting files, and forecasts which is for making your categorical making your character vector data categorical or for that matter can make your numeric data categorical. And it's actually pretty handy in ggplot. Probably all we really need to cover there. But when you load tidy verse. You automatically get some onboard data sets. This is wrong I just realized. One of those onboard data sets is is called Star Wars. And that is just data about Star Wars characters over. I'm not sure if it's overall nine of the original. I don't even know how many Star Wars movies are anymore, but it's, I don't think it's totally up to date. But it just gives you some information to play with that you might be familiar with a lot of people are familiar with Star Wars. So an easy thing to do is to use the glimpse command, just to get a sense of this data structure, right so this tells me I've got 87 rows 14 columns. These are the names of the columns right here with a preview of the data for each column. And the other thing that you'll see here is it'll give you a little bit of information about the data type. So I have character data integer data, floating point double numeric data. And I have some lists down here which for purposes of yesterday today we're largely going to ignore. This whole thing Star Wars is actually a data frame right so if I just type Star Wars by itself. I can see a grid of rectangular data and it's a preview, just a different way of looking at the data. So what I want to do is I want to plot height and mass of my Star Wars characters into a scatter plot, which is a great plot to start with scatter plots because particularly if you have numeric data, because it gives you an idea of how the data is going to look in a two dimensional representation, and then you can try and alter that ggplot code to create different kinds of plots. But let's first just review the basic structure of a ggplot syntax. So, you will see a lot of times that actually if I map this if I write this out a little bit more I think this is mapping equals. I should pretty certain that's the way that goes. The basic structure of the syntax is in ggplot you give it a data frame, sometimes referred to as a table, and then you map aesthetics so we're going to map the x and map the y coordinates. So in this case we're going to map height to x and map why to be the response variable. So mass is going to go to y and height is the explanatory variable that goes to x. And then you have a conjunction yesterday you remember we were covering we're covering these pipes, and the pipe conjunction is the standard conjunction in the tidy verse, except for when you're working with ggplot where it uses the plus as a conjunction. So this has to do with the legacy and the evolution of the tidy verse ggplot to was the first tidy verse package. Back then they use the plus as a conjunction. It's a little bit ambiguous particularly a statistical setting because a plus can mean addition. Whereas this, this is well, maybe not so convenient to type. It's unambiguous, because there's no other common compound of characters like that so everybody when they look at it they know it's, they know it's a pipe or conjunction. This is also a conjunction, and it only works in ggplot. So, and you can't use them interchangeably but you'll get an error message that will help you figure out that you need to use one or the other. In any case, that's my conjunction. So this first part here is just setting up the base of the plot. In your data frame, what are my x and y variables that I'm mapping as what are called mapping the aesthetics, and then you can also map. You can also set aesthetics, which is a form of mapping, but it's not to a vector right the aesthetic mapping is to vectors or variables in your data frame. And then, with the conjunction you then choose multiple layers after that so they're all called G on. So I might have G on one function there G on. I might have smooth here and I might have G on line here. We'll talk about how, how you figure that out, but you can then generate a multi layered graphic or visualization based on that kind of grammar. It's not just like in formally the formal construction, but in practice, people don't really write it out that way that what they do generally, particularly in the tiny verse context is you name the data frame first. And then you use your data frame conjunction the pipe, and you send that to ggplot. You'll notice here that in line 46 I'm using more to player variables are to player functions. I can further manipulate or wrangle the data frame that I have right so I have this Star Wars data frame if I highlight that and click control enter. It's 87 row 14 variable data frame. If I run just lines 4546 together I'm getting rid of any characters I'm subsetting by by only looking at the characters that have a massive 500 kilograms or less. I'm going to eliminate. Let's see. Looks like. Well eliminated a whole bunch I went from 87 to 58. I think that that also eliminated all the NAs could be wrong about that. Let's just see. I could also do drop in a mass right here and I'm just curious what happens when I do that yeah that's 59. I think that's happening in. And so, so the way filters working in this case is it's it's filtering out the NAs, as well as the one character that weighs over 500 kilograms or has a mass of over 500 kilograms right so that gives me a much smaller data frame. And I wouldn't need to do this but I will tell you I can use more supplier variables. Sometimes I do this when I'm when I'm developing a plot. And I just eliminate down to the critical parts that I really want to plot, which in this case is height and mass. Right. So I could just limit it down to that two dimensional data frame of two variables, and that might help me think through. Maybe I might have I might add gender if I wanted to add that for a color or something, but that might help me think through my plot. In any case line 48 is not absolutely necessary, but it's just a way of noting that you can use all of those data wrangling tools that you learned yesterday to set yourself up to then make a plot. And so in practice, people send a data frame they pipe it then to GG plot. I did use control enter to when I when I was selecting lines if I wanted to just run those copied those highlighted lines I would use control enter. Yes, that's a great question. And you say yours doesn't work. Hmm. I don't think I can troubleshoot that right now. You might try restarting your R studio might just verify that you are. If you're on a Mac you might try command enter. And when we get to a stopping point maybe we can, we can deal with that. But also just for convenience, right, you can just put three back ticks right there and, and make the size of your coding chunk smaller for temporarily. The back ticks are my keyboard the back ticks are above the tab button and to the left of the number one, but they're probably in different places on different keyboards back ticks. So, so I'm piping this to to GG plot. And then I'm doing, again, I'm not writing out the full construction I'm not writing data equals because that got piped in, and I'm not writing mapping equals aesthetics because it's just assumed and you'll notice I'm not writing x equals and I'm not writing x equals but you could, if it helps you think through what's happening here. I just want you to be aware of. This is sort of how conventionally people start writing it. So if we just run that much of the code. All I've done so far is I have sent a data frame with a height and mass mapped variables to gg plot. And what gg plot did for me is it then started to construct the canvas upon which. I can make my visualization right so it gave me an x label, which is from that mapping it gave me a Y label which is from that mapping. I can alter those but gave me those and also looked at the scope of those two vectors, and it decided what the tick marks and breaks and break labels would be for this graph, but I still of course don't have anything that I've plotted. Okay. But no one would of course no one would stop here right I just need one more layer. So I use that next conjunction the gg plot conjunction. And I say, let's use a scatter plot, which is called G on point. It's one of the many GM functions that we have available to us. And you'll notice that it then plotted all of those points. And just for the sake of argument let's comment out line 47. And you'll see why I filtered my subset of my data frame. Because that right there is job of the hut, who is a rather massive character. And it makes it look like my data doesn't have any pattern to it. I didn't want to be that way and I don't really feel bad about dumping job of the hot out of my data set because these are all fictional characters so maybe in real life that's not the best thing to do maybe it is depends on the data store you want to do depends on how you want to handle outliers, but that's a real world example of how you can generate a plot like that. Before we get into more about plotting which we will do in just a minute we'll do more. We'll talk about how you identify these other geometric functions and how you manipulate them and we'll talk about how you can add alter labels add titles use color, things of that nature. I want to just quickly cover a very simple tool for exploratory data analysis. Yesterday we we used we leveraged functions in the plier like group by summarize, which we could use to get column totals. We covered it too deeply but you can also get, you know, means and minimums and maximums and things like that. We may, we may want to revisit that. But another way to get sort of this the general scope of your data is to use this function called skim, which comes from the skimmer library. And if we just skim on gap minder or for that matter I might skim on Star Wars since we're talking about Star Wars. Star Wars. Let's just do Star Wars first. If I just hit control enter on Star Wars line 57. So tell me some stuff that I can figure out in other ways but it's nice and conveniently presented here, right 87 rows 14 columns, eight of the columns or character columns three of them are numeric, three of them are lists. And then using those data types, it tells me a little bit more about the vectors that have those that share those common data types, like the character. These names, what they're missing, what the minimum and maximum values are how many what the unique values are whether or not there's a white space, etc. numerically you can get. These are the lists and that stuff is not super helpful to me right now but tell me how what's the length of the list. The thing I like about the numeric representation and skim. And it gives me this aside from giving me the standard deviation and the mean, and the break points for the quantiles. It also gives me this nice little spark graph, which is a graph that I could generate a GG plot right it could be a histogram. Just to just to leverage this a little bit. What if supposing I didn't actually want to make a scatter plot but I just wanted to get a histogram of the value of height. I should be able to so let's just comment this out just make it clear. It should be able to it didn't like the fact that I had mass in there. Oh, what did I do wrong. Um, height not found. Why is height not found error in data do mapping blah blah blah blah GM bar object height not found. Oh, I know why, because I don't have a steps. And I have to put in a sense. So, you know, there's a histogram I can do it that way I can alter the size of the bins, that kind of thing. But, you know, if you have a lot of data and you don't want to generate all of that code just for just to fit figure out simple distributions of multiple variables, you can get that with skimmer right so I can see that the height distribution is sort of normal. And I can see that the birth rate and birth year and mass are essentially right skewed data representations. So, what I would like to do next is dig more into GG plot by going to. I'm going to go to oh to a vis answers and I like to use answers because it means I can type less. And I also want to give people a chance, especially people who did any advanced work and they, and they're, you know, you know what questions you have about the material shared so I'm going to stop talking for just a little bit and give people a chance to unlike and ask questions, and, and I see that should tall is asking a question in the chat you could ask it that way as well. Let me see what it says. The x axis is the same for all histograms generated using skim. I think the answer to that is yes. The x axis is going to be whatever numerical variables you have in your data frame so if I, that was one I did for Star Wars but let's go ahead and do it for gap minor, which I think is mostly numeric data. So you can see here like I've got a U shaped curve for year, and I've got a slightly left skewed curve shape for life expectancy versus right skewed for pop and genie cat. So I believe the answer to that is yes I'm not sure I completely understand the question so if I haven't answered it please feel free to redirect me. In the meantime, we'll also invite. I sorry, I mean is the range of the x axis the same. So, I guess you're asking can you can you alter the been with of this because the range is always going to be the full range of the data frame and as far as I know and if I'm, or is it specific to the variable it is specific to the variable. So this little histogram is only specific to year and then and then it will change. And then so if you want to generate a different more sophisticated histogram, you would definitely move into, for example, GG plot for something like that and we'll, we'll get there. Hopefully, I don't think I have a histogram example per se because they're not too hard to generate but if you, as we go on if you want to direct it, that'd be great. So other questions, please feel free to throw them into the chat or unlike. And meanwhile, I'm going to go to I'm going to get set up to go to this other code, which is O2 visualization O2 underscore a underscore vis underscore answers, we're going to use three GG plot on board data frames for this part of the text, or this part of the explanation. MPG is about cars and the miles per gallon of cars Midwest is some population data about some states in the Midwest and economics long is some economic data if I shift back to my four quadrant view. Let me just note that for example with MPG if I highlight that and on my keyboard I hit F1. With the help tab, I can read some information about the MPG, the GG plot to MPG data frame. So in this case, if it operates a little bit like a code book. That won't work for every data that you ingest it, it works for this package because the package has that documentation added into it. And you can see for example, you know that if there's a there's a variable called CYL which is the number of cylinders of the car that kind of thing. The other thing that I want to introduce to you which I haven't done before is that we may go to a document, there's additional documentation at GG plot to tidyverse.org let me put that in the chat so you don't have to type it in. And it's the same documentation is what you get on board. It includes a link to a spreadsheet. What I like to use it for however is, aside from the fact that it will really explain things and there's some really nice articles here about how to use GG plot to in different ways. I usually use it for the reference link. The reference link is broken down into all of these different elements of the grammar of graphics, grammar, including the layers all of those geometric functions right so if I want to make a 2D X bin graph that would be G on bin 2D. There's information about bar charts. So geom underscore bar geom underscore call a line charts got to be in there somewhere generally alphabetical but I don't see it so I'm just going to free text search for it G online. There it is right there. Right next to G on path. So all of these different layers are layers that you can generate, and they will have a lot of similar features, but each of them will have some unique features. So for example if I find G on. Well let's find Geo histogram underscore his notes right there. So give me the arguments for how I can operate the different layers that are similar. So in this case there's Geo freak poly GM Instagram, GM been and stat been, I'm sorry stat underscore been. But this is the one I'm interested in, you can see the arguments that that we've sort of already talked about like mapping and data. And the AES is common to all of them, but then there's a couple other extra things, but let's scroll down here. Here's more information about the arguments. And the other thing that's really useful here is the aesthetics. Oh, it's not as clear in this one I'm going to back back out of this one is not as clear. Let's go to geom point. Which is right here. And scroll down to where it says aesthetics, and these aesthetics, what goes inside of the AES argument. A lot of these are common to almost all of the functions like mapping the apps mapping mapping the why changing the color. The internal color or the outline color, color and fill or change in the opacity, how transparent it is, or changing the shape and size. But if you need to know what aesthetic arguments are available for which function, you definitely need to go into the documentation either online or on board in our studio. And again just to recap, all of those layers are listed there in the documentation. So, what have we done so far we have introduced MPG which looks like this. We make it bigger. And what I want to do is I want to I've got some some setup here make a scatterplot using displacement that is me scroll to the right. Where's my displacement sorry I didn't need to scroll because it's right there. Which is the size of the cylinder in other words how much fuel does it consume when the cylinder is fired with fuel. So we're going to, we're going to make a scatterplot with displacement as the X variable and highway mileage as the response variable or the Y variable. And that is highway mileage where is it right here highway right there. So, the setup for that pretty straightforward MPG and then GG plot for inside of aesthetics X equals display white equals highway. And that will give me a basic canvas. And then since I want a scatterplot. I'm just going to switch my conjunction and write G on point G on underscore point, and I get a very clear pattern there. So something here starts to go wonky but otherwise a pretty clear pattern. And now let's add some color. So, the question is you buy class of vehicle let's take a look back at the data frame and find class the class vector is right here we can see that we have combat cars, mid size cars SUVs, two seaters, things of that nature. So all I've got to do is take this. Add more aesthetic arguments in this case into the G on point. I think I have one too many close parentheses there. Now you'll see that I've got aesthetic arguments mapping X and Y up here in the general GG plot argument so that means that X and Y of displacement and highway are available to all of my layers. And then specific to the G on point layer. I'm mapping the color variable to class. And that's going to allow the plot to get drawn. It's going to GG plot is going to draw for me a legend that associates going to pick the colors for me, and change if I want to. But it will show me kind of a nicer view where I can get a better sense of how these different classes of vehicles fit into my scatter plot. Right, and then I can start to see over here where this pattern doesn't hold so well like what is going on here. Well it turns out it's all one class of car. So that means that they're two seater cars so that means sports cars, then I can start to understand if you know there are cars that that consume a lot of fuel in their displacement that still get higher gas mileage, and a reasonable question for that is that a two seater car is probably sports car. It's probably pretty light right much lighter than for example, these SUVs that might be consuming the same amount of fuel and displacement, or slightly less but getting considerably less gas mileage. Anyway, that starts to become clearer. And I did that with color note that I could change this argument to fill. And it's going to do something slightly oops, I didn't mean to do all that. Give me a second here. It didn't work because I forgot I needed a different shape. This isn't documented so well but I'm going to change the shape equals 21 I think no reason why you would specifically know that. I'm just messing something up here by trying to get too advanced and going off script. There we go. You'll notice here that Phil, if you have if I have the right kind of shape. And that will all be in the documentation Phil is just the interior color, whereas color, then could be something different, like let's call it yellow. That helps me bring out one other aspect that's important. Oh that doesn't show up very well. Let's call it green. That doesn't show up too well either. But, and more commonly I would use black, actually. But one of the things that you can tease out of this part of the discussion is that by aesthetically mapping things I'm mapping to vectors, but I can also set these arguments specifically to something right. So that it has nothing to do with the variability of the data is just I specifically want something to happen with the graph and you can do both of those together. So that's the last part of this question at a regression line for each type of car. So there's a function called GM smooth, which has the ability to. Well, let me, let me not uncomment that let's go ahead and run this whole thing as is, and you'll see that by default GM smooth is going to write. It's going to add in a gray bar for the confidence interval. And in this case it's doing its best guess as to what kind of smoothing of the line you want and so I know from experience, although I think I can look in in the console message says it's using the formula and it's reminding me that the formula for that regression and we're going to talk more about regression in a minute, but the towards the end of today, but the formula is why predicted from x, right. Now, it got y and x from here, right this is x equals and y equals. And that's happening. If I for whatever reason, wanted to alter that so that it was just a straight up linear regression. I can try method equals LM. And I could also turn off the confidence interval. Se equals false. And there I have a linear regression line specific to the data, right, but you'll notice in line 39 that I can also write regression lines that are specific to each class of data. So it's just interesting to know that you can do these things right now I have a low S line for each type it doesn't I don't particularly like that all that well I think I might actually make that method equals. LM but it depends on what story you want to tell. And you get a better sense of how you can start to manipulate the grammar in your multiple layers so in this case it's a two layer graph right and scatter plot and on top of the scatter plot a regression line. Okay. Well now that we know that now that we know how to make a scatter plot. So let's set information to make a bar plot. So let's look at our Midwest data real quickly. It is, among other things, population total population density, and population by various demographic groups. And if we go all the way over, there's a category here. Not sure what these categories are but it's a categorical setting, even if it's not a categorical data type. There are reasons to manipulate data into factor data type, but you don't always need to sometimes you can leverage the fact that the categorical data even without making it a factor data type. And not everybody uses that factorial information so I'm not going to dig into it right now. So I take Midwest and I just plot category that's this far variable by maybe making that essentially the x axis of GM bar notice whether 437 rows to this data. What GM bar is going to do is it's going to count up and give me a total for each one of these distinct categories. Right. So in that way, not on different from what we learned yesterday count category. It's going to give me a bar height for each one of the each one of the total values of the distinct categories. But if I comment that out and run that plot. I have a bar plot that shows me, you know that they're the largest category or the most frequent category is a are, and then there are several like hh are and hh you and a h you that are probably the least frequent. From a visualization perspective, generally speaking, I would not stop here because I would prefer to sort my bars. And this is where a different tidy verse function comes in really handy. And that is the tidy verse function that enables me to explicitly make the character vector. Variable called category. Right. Let's just review that this vector is a character vector but if I wanted to explicitly make it affect a factor data type. I could use one of one of the functions from four cats which is a tidy verse function. And specifically the one I want to use is factor in frequency, which will totally observations in order from largest to smallest. So I'm just going to manipulate that variable as a factor. And I need to turn this off. And then I have a sorted bar chart. Which is, which is useful because they're not, it's not always useful to sort of bar chart alphabetically, although that may be the default that we saw before. It's really more useful to sort of look at it this way so I can get a good, a good sense of, of all of these values in order. And I can quote however that there's this really nice feature of GG plot, which is particularly useful for bar charts that have longer labels most, most of which have longer labels that you can alter the axis by saying flip. This is the chord flip flip. It's coordinate flip coordinate flip. Notice what happens here is now I have. I've changed my x and y axis. You know, I have this slight inconvenience that now my, my order of my bars is not quite how I wanted it to be. So I could actually add one more. The function factor reverse and display my bars in a way that's that's more commonplace to best practices of visualization. All right. Now, there is a G on bar function right here. There's also a G on call function. Because while sometimes you have data. That's what we're doing here. Let's just look at the Midwest data. Right, I have, I have like what we did there is we calculated we had GG plot count, each one of these AARs. And count each one of the LARs and the L HRs, but sometimes you have totals. Right so for example if we ran. If we want to apply our functions to get population totals by state. G on bar is not going to work in this case because it's, it doesn't know what to do with this information. Right G on bar only takes one argument, which is mapping the x axis. This is the same as saying x equals. We're calculating by itself, whereas G on call needs an x and a y axis. So in this case the x axis is state and the y axis is state pop. And then G on call will work just fine and we will get a bar chart of those values. And I would probably be inclined to order this. Factor reorder state by state underscore pop. And unfortunately that's not in the order I want, but of course remember, I could factor reverse, which I did something wrong spelled something wrong, still spelled it wrong can't spell. There we go. And then there's some things I don't like about this chart like it doesn't have a title. It doesn't have a very pretty markings here. And I don't like I don't particularly like the x and y labels, all of which I can change. Let's see what happens here when we run. Yeah, I'll just go ahead and do it up here. The first thing I want to do is I want to. I want to add some labels. So I'm going to use the lab function I'm going to say title equals state population. I could add a subtitle if I wanted to. Oops, that doesn't subtitle equals. I can change my x axis to state. Currently it's an auto generated label. And I probably don't need a y axis but I don't need if I don't want it at all I could just make it y equals nothing or I could put in the word population. I could also put in a little caption about my data source. Colors colon CG plot. What to colon colon Midwest. Let's see how far we get with that. So there I've, I've improved some of the labeling. Still not crazy about this. A couple different ways I can handle that. I could go in here and add some change to the scale. So scale. In this case it's the y axis, and it's a continuous variable, right. So I can do that and then all of this is in the documentation but I just want to introduce it to you. I could do this thing where I say labels equals and then run a new function scales colon colon comma, that will make it pretty. You know that's pretty but it's those zeros are a little superfluous, I could do something completely different I could. Let's let's comment that out and do it a different way where I set the brakes equal to and I'll set a range let's say I want 5 million. 10 million. And that gives me just two breaks, still in scientific notation which maybe what I want or maybe I want to make it really pretty. So then I'm going to say for for the number of breaks I have. I might say 5M and and you know in that way I can start to alter not only the with scale arguments, am I altering the labels, but I remember up here way up there I was introducing color. I can also use the scale argument to alter the colors that are chosen. In this where is it hold on let me grab this copy in this little guide that I worked up a lot of that stuff and let me paste that into the chat. A lot of that stuff is about scales and color is covered in here about how you can use different kinds of functions and things like that is sort of a quick sheet. Quick reference guide but if I go to scales. You'll see that I'm using different colors. I will choose colors for you if you don't choose any. And then you can use scale to alter your colors to different pre chosen categorical or continuous color ramps, depending on what or you can even set them manually, depending on what you want to do. That can be a little tricky to work with it first. I don't, I'm not necessarily going to cover that right now I'm happy to, if you want to but I'm going to let the group sort of direct that if that's of that if that's of interest today. Another thing to mention right so we've now done some scatter plots and histograms and box plots, all with the same grammar. And add a line plot because one of the things that GG plot will do very nicely for you is it will make a time series graph. So if we look at economics long. Actually before we do that let's look at economics. So I can mention to you the difference between long and wide data. So this is why data, which is not particularly tidy as we were talking about yesterday right there's this observation has a has an observation for PCE, an observation for pop and observation for peace saver unemployment whatever. There's 574 rows, but tidy data would be long there would be some redundancy in the data, but the PCs E operations are going to be there and if I scroll through. Maybe how far I'm going to scroll through to get a different observation. I have pop things of that nature. So, my point being that when you transform your data to long with GG plot. It's a lot of times it's more easier, it's easier to iterate over that data to generate the graph that you want. Next about pivoting, but we'll start out with economics long, and we're going to set our X bear variable to date, which you'll notice is a specific data type called a date. And if it is a date then GG plot will do smart things with that date in terms of how it displays them and how it chooses labels. So x equals date, which is common in a time series. And why in this case is going to equal value, or it could be value one value one is just scaled between zero and one for the values in the value column. And then I'm going to set you online and identify a color by the variable that is the different distinct categories that show up in the variable column called variable. Right. And I get this graph, which is we look at this a little bit yesterday based on a question. It's this one variable here called pop is very much out of scale with the other numbers. So it's hard to tell what's going on here. But if I change this to value one which was the, which was the scaled value. You can get a better sense of how all of those variables work over time. Everything scale be since zero and one, but notice on the x axis. GG plot chose to display the label the x axis by decade. Even though if you look at the data. It's not displayed that way in the data frame. Right. It's a year month day, but GG plot has done some calculations and said well you don't really need to know all that stuff based on this data frame and that's something that's particularly challenging to work with in data in general is date type data and time data. There's a nice library to help you work with that called lubricate. Let me just. Write that out so you can see it library lubricate. And that will help you manipulate date information. But we don't actually need it for what we're doing today. Another thing to point out is, well that number is scaled. The number doesn't tell you much about the actual values of these different variables. So a different way to do that, we talked about that before is to use the facet wrap. So we're going back to value instead of value one which value one is the scaled one, and we can display all this together. And, you know, some things get muted because they're out of scale. So you can add a function called facet wrap and create a facet graph by the 1234 variables listed there. And they're still a little muted, but you can see them a little bit differently. You can also have these nice features where you change the scales free them up so I can type free underscore why all this is in the documentation. I can't why scale for each one of these facets but notice that it is still using a common X scale where it can. Right, I could also and row I think and row equals one. In that case. It's repeating the x axis all the time. And call equals one, which wouldn't make a very nice graph but you can see that it has that common y scale. The other thing I could do, ignoring all that. Well let's just do and row equals one and I'll show you another feature. This feels a little squished it looks a little squished. This legend is not necessary because I can see that this is the unemployment and the p savored and the pce so I might turn that legend off right here. Show legend equals false. And I might try and make it even wider by clicking on this little gear over here, which will set some models, some settings for this code chunk. Now toggle this thing that says use custom with and I'll change the width to let's change the width to 12 and click apply. And what happens is it wrote this in up here. Now if I write run this one more time. I've got a much. It's actually a little I feel like it feels like it's a little too, too long let's make it nine by six. See what happens there. It's a little better still big. About nine by three. There we go. You have all of those features to, to manipulate your your graphic and then in the end if you want to save it you can use this function called gg save and give it a name. Let me give it a name we're going to call it example. PNG. And I'm going to zoom back out to the files menu and you'll see that it will show up right there in the root of my of my R studio project. And it's right there and then I can look at it. Manipulate it copy it paste it do whatever I want with it. Okay. What I want to do now I'm going to send you a little survey. Oops. I didn't mean to do that. It's okay. I'll get that straight. And the survey will help us figure out what we're going to do next and it's going to ask you, do you want to talk about interactive visualizations, just pick two from this survey, or do join and merge or do pivot or do regression. We'll try and cover them all. And there should also be questions and some and somebody asked, can we save export the figures into MS word and can those be editable. So that's an interesting question because let me click share again and go back to that screen where I was, which is where we're here. So this is just I mean this is just a it's a PNG file because that's what I called it, but I can make it a JPG file, whatever I can also change the height and width here but I'm not going to do that. And you can move that into word as you see fit, or picking up from yesterday. I could just knit as a word file. And it would exist in the word file. Line 60. That's interesting. Let's try this again knit, knit to word. And the file in this case the file that it's going to generate on the fly is is still going to require some more formatting, but Oh, and the execution still halted I wonder why did I do wrong. Here a function that produced HTML output found in document targeting. Oh, that's interesting. I'm doing something to I'm doing some too too much of something here I'm not sure what I don't really want to troubleshoot it right now but you can generate the word file from here. And you can also open these files in additional editing tools and do more things with them so for example I did that a minute ago, I just clicked on it and it threw me into a Microsoft Word editor, which allows me to, you know I can do more. Now this tool works because I, I have never used it but yeah I mean I can. I can, I can do more stuff and save it and do whatever I want to with it. Charlie asks what is the command to change background color. So, it's a good question Charlie. Let's run this one again. So there are themes that you can use so for example I can say theme underscore classic. And it will change that to white I could do theme theme. There's a whole bunch of themes and you can see them there. Gray light dark line draw. You can also do void, and then you can really get into the. You can set arguments inside of themes and generate whatever color you want, depending on what you're doing. All right, let's have a look at the, at the results of the survey. Let's take another question here. If you haven't filled out the survey please do it says in state there's a feature that allows to save quote already configured graphs as graph underscore name dot g ph to then be able to edit them again. Is it as useful as you can edit graphs directly without having to record it all right. So you can do the same thing here. So let's take a look at this. Let's assign this to call it my plot. Notice that if I just run line 6364 I don't get any output because all I've done is I have assigned that graph on computer. And I get my, I don't understand why. Oh, all I've done is I've assigned that graph to an object called my plot. If I then call my plot. I get that graph, right. And then if I want to, I can do more with that more and ggplot with that graph, like probably give it theme dark, and it changes it again. I could probably add it but yeah you can do more to it. Right so you can. You're, you don't actually so much save it as you assign it as an object that you can continue to manipulate. All right. And scroll down to the bottom. All I have to do is load this library called plotly. And just like I did a moment ago, where I generated a ggplot object. Let's, let's just run these lines. That generates for me a bar plot it's a stack bar plot it's colored by filled in color by the state. And it is a bar plot of categories, whatever these categories are. And I wanted to make that bar plot interactive. And the easiest way to do it is to use that plotly function called ggplotly. Now there are other ways to make it right there's a whole range of these things called HTML widgets. And if you Google the phrase HTML widgets you'll run across this gallery, you can do it just a number of things but we have time to, to sort of cover this quickly. And on that I get it I'm in a whole different feature where I have some interactivity that I can, I can zoom in on things. I can get fly out windows, and it's still drawing the legend I can turn on and off this toolbar. I can make a picture and download that as a PNG really nice feature you can then integrate that into HTML reports or dashboards or Web stuff. Another layer of interactivity would be to make these visualizations using an R tool called shiny, which is more advanced and we're going to have time to do and somewhere on the R fun site. There's a shiny there's an introduction to shine that you that I would encourage you to look at if you're interested. Okay. And joins and merges. Let's do that next files joins and merges pivots and joins. Okay, so for this one, we're going to join answers. We're going to still use the tidy verse packages. And we're going to read in some data some onboard data that is in this data folder right here called 538 civility ratings. We're skipping 11 lines that has the provenance of where that data came from it came from the 538. com data journalism, GitHub site, and I manipulated it some but basically they did a survey where they asked people who their favorite Star Wars characters are. And, and so you can see that you've got for example Han Solo, who has a rating of 610 so he's in this case, a very favorite character, and then maybe look down at Lando Calrissian Boba Fett, they're not so loved. 110 Emperor Palpatine like he's a big villain right people don't people don't like them. And historically people have not like this character Jar Jar Biggs. So you get a sense of what that's like and what we want to do is we want to. We want to join that data with a different data frame right so we have already Star Wars data. And if we look at these two together. One of them is a 14 row two column data frame, and one of them is an 87 row 14 column data frame and what we want is, we want to add those two data frames together. So basically you need some kind of join key. Some common variable across both data tables. So, just so you know, I could add these data tables together with something like bind rows. I think fave ratings and Star Wars. And the problem is that they don't have common variables. So, I just end up with really just a bunch of garbage in a way, like this favorating only goes on for 14 variables. And, and there's no, and some characters are repeated. So you can do that if your data is already well formed and good to add together. But what I want to do is I want to make a join that makes a little bit more sense. There are different kinds of joints I can do. And in the tiny verse, it's called the one joint that most people do is called a left joint. And that would look like this I could describe these joins to you but it quickly turns into word salad. It's a little bit easier with the bend diagrams. Right so if I have a left data table and a right data table, anywhere where there's a commonality that join key that I was just mentioning. So bring over the data from the right data table. Right so you can do the opposite of that and you can find the only the intersection and then you can find for example we'll use an anti joint, where we want to find out what exists in table x, that doesn't exist in table y, and vice versa. So we're going to use the join key name. And let me just say in advance that this is from a data from a data manipulation perspective this is a bad plan. Ideally when you do joins, you want to do joins on unique non ambiguous constructions, which sort of explains why we all have a driver's license I number and for example, a Duke University employee ID or student ID number is we have something that's unique to us that it's easy for a computer to go. Oh, this is, this is just very clear your numbers. I like to say, for example, in, let's say in the Star Wars universe Luke Skywalker's employee ID is 001 and Emperor Palpatine's employee ID is 666. It's easy for the computer to figure out those things and match them up where they're where they're appropriate. But we don't have an employee ID. And so character data is going to be more fuzzy and name data even more fuzzy because of the alternate ways that you can spell things. You may not know how to spell them. You may capitalize things or not capitalize things. In any case, it will illustrate a good example. We're going to use name, the name variable as our join key because it's common across both tables. All right, so the way you would write that specifically is is we would take the first table that's going to be our left join and left join it to our right table. And I you'll see that I have line 44 commented out the long construction of this is to say by and then say where the what are the joint keys in this case they're both called name. So I don't really have to write that out but I might have name in one table and last name. In another table and so it gives me that flexibility, but I can write it without doing that. I can do a left join of favorability rating the Star Wars. And when I do that. It now gives me it as one column it used to be a 14 row. I'm sorry it has many columns. It was a 14 by two data frame favoring, and then it added 14 more columns, or 13 because name is is common to the two. So now it's 1414 rows by 15 columns. You'll notice there's a lot of n a's. These came over I think from the original. No they didn't. Sorry, those n a's are where there was no match, right there was no match from the favorating Princess Leia Organa to Star Wars. Now, you'll instantly notice this doesn't really make a lot of sense because she's a main character, and the other table is bigger. So the key join was ambiguous, and I'll show you why in a minute. Let's go ahead and arrange that sorted, so that you can see what we've got here, we're sorting by favorating. And again Han Solo shows up as the most liked character. I'm going to use then anti join in this case, just to kind of figure out what's not matching. So in my first case I'm saying show me what exists in favorating that doesn't exist in Star Wars. And then in my second anti join, show me what exists in Star Wars that doesn't just exist in favorating. And it's interesting because there are 79 characters that don't match here and a lot of which are not in the first table like BB eight I don't even know what BB eight is. I'm not familiar with a lot of these characters, but I know the main ones. So if I go back here to the first one. It just makes no sense to me to see C3PO Emperor Palpatine, there's Princess Leia. Why are those not matching. And so I'm going to do a little bit of magic here. That's not really magic but advanced regular expressions to find Pat text patterns in the other table. They're not regular expressions today, but they come in really handy. But let's just look at this. See dash 3P0 does not match C dash 3PO, because computers are are very literal. Right, zero and oh are not the same thing as far as a computer is concerned. That's not a match that goes to what I was talking about unambiguous clear matching works best with IDs that are unambiguous. Princess Leia Organa does not match Leia Organa. You can do things to manipulate your match key when it's characters like you can make everything lowercase take out the diacritical remarks take out the spaces. You're still going to end up with not as clear a match as if you had a crisp unambiguous ID to match on, but you can improve your chances. Pivoting. So we'll take economics here. This is some wide data. And let me just note that a lot of times in our in the tidy version particular. It's advantageous to have long data. So for example, instead of 547 by six rows, we would have 2,870 by four rows advantageous because it's easier to iterate over those variables. But sometimes depending on the function you're trying to manipulate. Here it's better to have the wide data so you can pivot either direction with two functions one called pivot longer, and the corresponding one is called pivot wider. And people who've been using our for a while, roughly corresponds to two functions called spread and gather. And there's some other functions that are not coming to me right now. Economics long arranged by date. It's the exact same data that just it turns out to be some redundancy in the data but it's not so big of a deal anymore back in the 60s and 70s data storage was expensive. Now, you know, you can get a phone that has more storage than the more disk storage or RAM storage than the computer that put a rocket on the moon or lunar lander on the moon. So, a lot of times you're better off to have more data redundant data that has some sort of built in semantic meaning. And our enables you to deal with that. And the way pivot longer works. Let's look at economics if we wanted to get this view from this and that's what it's already done right but let's just take a look at economics. And then the arguments to pivot longer are what columns do you want to pivot. I'm saying PC through PC through unemployed. So this and this and PSA work all the way to know this through that. Those are the columns I want to pivot. And I want to change those column names into a column called variable names to these names become a column, and then the values are going to go into a column called values value. And I'm going to end up with this right here is here are my column names, PCE pop P saber unemployed, and here are the values of those observations. And by extension that makes it more. This is now tall data. It's a little bit easier to iterate over and make, among other things make graphs with minimal amount of code. So we're going faster than I expected but I guess I kind of stepped it up there so let's go ahead and finish up and talk about regression. This is probably the more complicated and go ahead and throw it, you know, unlike and ask a question if you want to. I will try and demystify things but I'm going to open up zero to see regression answers. And in this case we're going to use a library called broom, which is part of the tidy verse. I'm using the plier and gg plot. And probably for for concise code I should probably write it like this. Because I don't need to write to plier and tidy and gg plot to when I'm, because the tidy versus going to load those anyway, broom and modern dive actually sort of do the same thing. It's just that modern dive is really a package that leverages broom to teach statistics. It's not a statistician it's helpful to me to to have run through the free broom book which by the way is available online and free here it's called modern dive.com. But I'm only going to talk about the broom functions, although I use both in this code set. Just load those three libraries. So this is the instruction of making a linear regression, right so I have my Star Wars data. If I look at that. Notice that the argument is is pretty different I can't use my pipes here. So data equals Star Wars. And then in the way you write a linear regression argument as you say mass is predicted from height right so the response variable predicted from the explanatory variable. If I write additional variables. It's not it's not appropriate for this data set but I would use the plus construction, for example, birth underscore year. If I could, I can convert my, I can recode I color into the Merrick categorical variables. I could say, plus I color, but that's not going to work in this case because it's a character variable at least I don't think it will work. So, and let's take that all let's just keep it simple I just want you to see that construction that is the way you write a linear regression statement using the linear lm function. And I'm going to assign the output of that to a function to a object name called my model. So my model. There's my coefficients. It reminds me what the formula is. And a lot of people will then use the summary function on my model, and they'll get a bit more information. On evaluative statistics markers about the model itself like the R squared and the p value, the adjusted R squared. Note, however, that this, it's great to be able to see that on the screen, it's hard to manipulate that. And that's where the broom function comes in. Right. So, I could type in room. I can type in the tidy function on my model. And the result of that will be a data frame. And so there are there are my coefficients, including my p value information, or I could use glance. I could write some additional evaluative information, like the adjusted R squared and this information will will change depending on the type of statistical model you're doing right so there's lm there's also GLM and there's a list goes on and on depends on what you want to do. So, but broom and glance for linear functions and for what not the advantages that it puts it into a data frame. So for example, if I just now I can use my supplier functions to easily pull things out, like I could say filter where term equals height. So I might select estimate and P value. See if that works. Yeah. So it's it's just becomes, and you know, if for some reason I wanted to get it out of the data table again, I can do that. So it becomes easier. Once you understand that grammar of the plier becomes easier to manipulate the outputs of your model. If you use broom to tidy up those outputs. There's also a thing called augment, which will give you the residuals of your output and the predicted dot hat fitted that kind of thing. Let's look at how that plays out right so in the broom. I'm sorry in the modern dive library. There's a data set called evals. And we're going to just pull out a couple variables from evals. And look at the smaller data frame of 463 observations, where there's an ID variable of the observations. Everybody gets a score. They all have their age recorded and there's a subjective variable called beauty average. And I don't exactly know the origin of this data but it's essentially the synthesis is going to be the age and beauty average affect the score of these instructors this comes from some evaluation of college instructors. So, we're going to use that data. We could using that same summary function. We could get some general data about about these characters. Let's talk more about these variables. By the way, we could using our, what we've learned about gg plot. Since they're all numeric variables. Let's see if this is going to work. Actually, I would have to pivot it. I wonder if I can do this real quickly. Let's take a longer score through age names to variable. And I don't think I need. I don't think I'm. Oh, there we go. Now I've got long data. Send that to gg plot and call my. variables variable and my y value. And g on box. Walked across your fingers. You know I this is similar to what happened here. Let me some visual representation that I see in numbers here the quantiles of the box plot the first quantile the last quantile, the medium, the inter quartile range. And if it's got outliers like that one does a representative dot. I can use skimmer on that data. If you look at the data we talked about this already I can get a correlation of how score relates to beauty average. And in that case, it looks like it doesn't look like a particular I think that's not particularly strong correlation. There are ways to get correlations for example if I go back to my Star Wars data correlation of mass to height, leaving out again Bob job of the hut. There's a much stronger correlation. I can visualize some of that stuff. So, making a scatterplot here I'm using g on jitter, which is a way to deal with. It's the same as it's the same as g on point but it, it repels points that are sitting on top of each other so you can see clusters. And then, and you can see that in this case score and age are not particularly well correlated don't appear to be, but you can see these jitter jitter clusters here are the data the underlying data is not changed but the visual representation is repelled just so that you can see where there are clusters of data. There's another get correlation of age to score. And here we're going to fit that model right so we're going to how to score get predicted by beauty average for the data evaluation we're making that a linear model, and we're putting it into an object called score model. And if we use tidy score model, get our coefficients and a table, we can use glance to get our evaluation information, including the R squared. And we can use augment and once it's in those tables of course not only can manipulate with the plier, but we could manipulate it with GG plot as well. 10 minutes to go. I know that was quick on regression but that was the least voted on so help me know. She tall says, in the interactive bar plot. Is there a way to show all variables in the data set or doesn't just show the ones we use for coding the plot. So it's just going to show so that much. Let's go back to that one. So, remember that what we did in this technique what we did first is we just generated a bar plot. So putting interactive to the side for just a moment. We're only plotting one variable category. And then we're using the variable state to make it a stacked plot. Not that this isn't not that this is important but let me. Let me at least clarify that part right it's just a standard bar plot bar plot that we that we used a variable to set and make it stacked. So when you say using all variables remember that Midwest Midwest has got a lot of variables in it, but we're not visualizing all of those variables, and the interactivity just is based on the plot that we generated. So I think I'm a little confused on the question. Yeah, so if I understand correctly that means it is just going to show the variables that we used for the code. If I were to say if there was an extra variable in that data set called age. There is no way for me to tell the plot that show me age also I don't want to go to the age, but in the interactive window that pops out. It is a lovely feature I feel. So I was just wondering if we can just add, you know, a label or extra something there but hold on. There we go. So you can see here I added. So is this what you mean by adding a label. That's helpful. Yes, yes. So then let's see if, if I do that let's see if I can. See if I can send that to G to to to to. GG plot Lee. To see if that label comes through not everything that. Yeah I did. So you can see what's happening there. So that's one way to add a label also you can add labels. So now if I turn that into a grub barplot. Yes. state. Total pop. call. So I've got my bar plot there and then I can add a label this way. Label where AES equals label equals total pop and we can add label.