 So in this tutorial, we're going to look at the ubiquitous box and whisker plot. You'll see them in many journal articles, in many reports, and they really help us to demonstrate the spread of data for a numerical variable. And it does it so by the use of the quartile values. And they also be useful to help us indicate possible statistical outliers. So I'm here in RStudio and what I've decided to do in this little tutorial is just to show you the actual RStudio environment as opposed to the RPUB's document which was already rendered as an HTML file. So remember we'll have the yaml up there, the markup language that tells the page how to render and what should be in it. And we're going to render it to an HTML document. The table of contents will be set to true and we're not going to put any numbers to the section. In the first code chunk here, I'm going to put the libraries that we're going to work with. We're going to work with the tibble library, so if you haven't got that, go to packages here and install that package by adding install and typing in tibble. Same for dt, that's data table. And of course we're all here because we love plotly. Next up, I always just style my HTML file with a bit of CSS code, cascading style sheet code there. And all I'm doing is the h1, the h2, and the h3 headings. Remember the h1 headings, that's the largest type that a web page will just put on your screen as far as the normal text is concerned. h2 will be slightly smaller, h3 even smaller, and it goes down to h6. And then of course just normal text. So I like to put in some hexadecimal codes there for my colors, the three colors for the three text sizes, that's it. And then I also have my logo up there. This is how I put in the logo. I don't put any placeholder text there, so just open, close, square brackets. And then of course this png file is in the same folder as this notebook. So no problem there. So you can read up a little bit about the introduction. If you view this page as a rendered page on my RPUBS website or page there at least, and remember that this file will always be available on GitHub, on my GitHub repository as well. And the links to all of this, of course, down below. So read up there. What we're going to do here is just to generate some random values for three variables. And in case you've never done this, because I always just run through it very quickly, I've put quite a few comments to the code here. So if you want to just read what every line of code does, it's all out there for you. What I like to do is just to use the set.seed function. And I just put an integer there in this instance 1234. That just means if I run this code over and over again, I'm going to get the same random values back. So it all looks the same every time I run it. So first I'm going to create three list objects. I'm going to call the computer variables income, stage and country. Income is going to be from a random variable from a normal distribution. So we're going to use the rnorm function there. And we can see the rnorm function there takes three arguments, 500. I want 500 grand values from this normal distribution with a mean of 10,000 and a standard deviation of 100. But you can see I've wrapped this as a first argument for the rnorm function, comma, the second argument is digits equal 2 because income it's in sense. So I want two decimal places there for the values that I get. Stage is going to be a nominal categorical variable. And I'm going to use the sample function for that. And then I'm going to pass it the string vector as the sample space from which to draw these random samples. So the C function there. And I put the three values early, mid and late in there. So that's my vector there. That's my vector there of strings. And then that's the first argument for sample. The second argument is how many I want. I want 500 and I always have to say replace equals 2 because if I draw one, say for instance I draw late, it's got to be chucked back in the bowl so that it's available to be redrawn at random. And we're going to do exactly the same for country. Again, it's a sample. We see the sample space there of my country variable. And again, 500 and replace is true. Now I'm going to create a tuble. Remember, tuble is just a more modern form of a data frame. It's not going to create factors from my categorical variables and the prints nicer to the screen if you render to HTML. So just use tuble there. I'm going to create three columns, an income column, a stage column and a country column. You see the uppercase letters there just to distinguish them from these three list objects that I created. And I'm actually going to pass those values to these columns. Now remember these columns with uppercase, that's my variable name. And we're going to refer to them when we create our plots. And then I'm going to use the data table function and that's from the DT package. And that just prints a very nice table when you render something as HTML. When you render it for the web, it really has a nice search bar and you can go from ascending to descending order and you can see all the different pages of data. So it's a very nice package just to render spreadsheet type data, data frame or tuble to the screen. So if we run all of that, there we go. What I should probably do is there we go. On the right-hand side there, we can see the viewer and that's really what it's going to look like. It's a very nice, this data table when it is rendered to HTML. So let's create a simple box and whisker plot. So what we want to do is to look at this income variable, this 500 values and create a simple box and whisker plot. And I'm going to store it as a computer variable P1, my first plot. The function is always plot underscore Ly and I'm just going to pass the arguments directly. Remember, I could have just added another trace, but we've just got one year. So let's pass all my arguments as to this function. First of all, type equals box. These are keyword arguments. In other words, they have names. So it doesn't really matter what order you place them. So type equals box. So box plot. On the y-axis, I want income. Always remember the tilde. And I use the uppercase I, which means I'm referring to the tibble, the data frame. So I better tell it that the data is in this df data frame that I created. So if I use the lower case, it would have just referenced the list. And I didn't have to say this comes from this data table or this tibble. And then I'm going to give it a name all income and I'm piping this to the layout function because I want a title and I want something on the x-axis and y-axis. On the x-axis, I want more than one thing. So I'm going to pass those as a list. So the title is nothing. No title there and the zero line is false. So on the zero line, I want them when they're drawn to the screen. And you can see the y-axis is going to be income. And let's print it out and see what happens. It's going to print here in my viewer, a lovely box and whisker plot. So let's have a look at that. We see the overall income. I put all income on the x-axis because it's just drawing it from this name that I gave it there. That's why I didn't put in a title. And then income here on the right-hand side and you can see it's been marked as 7,000, 8,000, 9,000, 10,000, etc. So what is this box and whisker plot all about? In case you didn't know, the box in the center has these three lines. The lower line is the first quartile value and you can see there Q1 and you can see it's at the value there. 9,325.805. That is going to be our Q1 value. The second one here is the median and you can see there the median or the second quartile value and the upper edge of this box, I should say, is the third quartile value. And then you can see these whiskers go up and whiskers go down and what is happening here in plotly is this upper limit and this lower limit will either be the minimum or maximum value if there are no outliers beyond that but if there are outliers beyond that this is going to be one and a half times the interquartile range. Now remember what the interquartile range is? That's the Q3 value minus this Q1 value. You multiply that by 1.5 and you add this to the Q3 level and you subtract that from the Q1 level and then you place these and anything beyond that we see as statistical outliers and you can see they are nicely plotted there for us. If everything fell within those then of course either of these whiskers would just be the maximum or the minimum but there are outliers beyond both so you can see this will be one and a half times the interquartile range above this and one and a half times the interquartile range below and that's why you sometimes see these interquartile ranges these whiskers I should say one longer than the other because you think well it's one time it's some fixed distance away from the upper and lower quartile that just depends on whether this is minimum and maximum or showing the interquartile range itself. Now many times you just want to plot this on its side so nice horizontal plot and all we have to do there's there are more ways to do it but the easiest way to do it is just to change it from the y-axis that we had there so income is on the y-axis now we place income on the x-axis that is it that's all we're going to do and then remember we should just put this income on the x-axis now when we plot that text and if we run this code we see it's just nicely plotted on its side so I've put income on the x-axis here as far as the title is concerned but I just change this to x so the numerical variables now actually on this drawn across the x-axis so that's very simple now sometimes we don't just want these outliers plotted we want all of the data points plotted now this is not going to look so nice because there are 500 of them but the way we can do that is to say box points equals now there's a couple of these arguments we can use I'm just going to say all, I want all of them but if they're all printed on the same line you can imagine that you're not going to see all of them someone will be plotted right on top of each other so we add a little bit of jitter so that they plot next to each other I'm only going to set that to 0.3 and then the point position negative 2 that shifts it to the side of the box let me show you let me show you the outcome there we go, we put this on the y so it's going to be to the left negative 2 to the left you see all of them there, there's a bit of jitter so it prints left to right and you can see all the values they're all 500 of them there what if we want to add a mean and standard deviation because remember these are just the quartiles that we have here I can add this keyword argument box mean and if I set it to SD it'll actually do the mean and the standard deviation if I just say box mean equals 2 it's just going to draw the mean in as well but I always put it to SD so then we have the mean and the standard deviation and it's going to do these dotted lines for you that's what it looks like so you can see the mean and the medians very close to each other there the dotted line is very close to the median line and then there's diamond where it intersects with the whiskers that actually indicates to us the standard deviation great stuff now let's create more than one box plot and seeing that we are dealing with a table or data frame we can just split it up by some categorical variables on the y-axis I still want the income so it's going to be this upright but I'm using the color equals and that's got nothing to do with the actual color red, pink, blue, orange, etc it's the keyword argument for splitting the numerical variable up by the sample space unique data point values in this categorical variable stage so let's have a look at that so remember there were three three unique data point values in the stage in the sample space of the stage variable there was early, late and mid and now you can see the numerical variable is split up along those three that it found so that will be income only for early for data point values that are early in the stage late and then the mid there at the end now we can be even fancier than that we can first split it up by the country now look at that I say x equals country remember there were two there was Canada and US so it's going to split it up by that first and then by the color so for each of the two countries we're going to have all three stages and what we have to do here is in the layout add the box mode and I'm going to say group them please now you might get a warning an error message saying that box mode is not a keyword argument and layout indeed it is and it does work so I've come here and I've disabled show warnings and show messages you can see that their message equals false and warning equals false up there and let's just print it out because indeed it does work no problem there so now we can see it's country and then each of these are early late and mid and always remember with plotly I can turn on and turn off some of the values that I do want or don't want so let's turn that back on let's turn mid off now I only have early and late it is really fantastic I just love plotly now let's just play around a bit what I've got for you here is just to change the outlier marker shape and if you go on the plotly website you'll see there are plenty of shapes you can give so I'm going to say marker I'm going to bring in the keyword argument marker and as a list I've got to pass everything as a list even though I only have one I've got symbol equals square dot square dot let me show you what a square dot looks like so those outliers are just tiny little squares sometimes that looks a bit better because our box is a rectangle anyway let's just change the colors a little bit I'm going to stay with the square dots I'm going to use the full color argument I'm going to set that to a word one of the words that are allowed and it's pink and then the line that's the out this outlier line here or not outlier this outline I should say I'm going to make the color gray and I'm going to make the width pixels so let's have a look at that by just adding the full color and the line keyword arguments and I see this nice pink and I see this this gray outline that we have there of course we can just choose let's just right choose correctly there choose of course we can choose a color set and there I'm going to use the colors keyword argument so remember there's color that is just going to split by the sample space of the categorical variable but colors that's the actual colors and there are a couple of these named sets you can find them on the Plotli website we're going to go for set three and they're definitely set one set two and set three and some others and you can see this nice pale green and this pale yellow and it's split it up by that so really box and whisker charts ideal for your numerical variables at least to show the spread in the data and you can split that up by various categorical variables