 It's LinkedIn Learning author Monica Wahee with today's data science makeover. Watch while Monica Wahee demonstrates making box plots in R. Hi everyone. So today let's talk box plots. What do you need to make a box plot? A continuous variable. And look at what I have here. A continuous variable. Staff beds. So what kind of experimental units can have staffed beds? You got it. Hospitals. Massachusetts hospitals to be exact. That's the data we are using today, which I copied from the American hospital directory as you can see here. So we're going to use the staffed beds variable from this hospital data for the state of Massachusetts. Oh, excuse me, the Commonwealth of Massachusetts to demonstrate our box plot. So here's what I did. I made an Excel spreadsheet with two tabs. We are on the first one, which is called MA-Hosp for obvious reasons. You will see I just copied out two columns, hospital name, which I called hosp name, and staffed beds. Also, you'll see that I removed all the zero bed records. Those places really do have staff beds, but there was just a reporting error. That's not very scientific, I know. But this is just a demonstration. Oh, and also if you do this at home, make sure you designate this column as numeric with no decimal places. You'll see in my live stream recording where I fight with the data types, the life of a data scientist. Box plots can only be made with numeric data types. Oh, and I made you another data set too that I want to use in my demonstration. I wanted to demonstrate doing one population level box plot. That's what I'll use this MA-Hosp data for. And I wanted to demonstrate making a comparison with box plots. So that's where this next tab called city compare comes in. Let me click on it and show you the data. See what I did? I kept the hosp name and the staff beds field, but I added this hosp city field. But I didn't put all the Massachusetts hospitals in there, just the ones from Boston. And as you can see, one's from the Minneapolis-St. Paul metropolitan area in Minnesota. I'm from there, so I know the hospital's there. I didn't want to work with unfamiliar data for this demonstration, you see. But also, you'll see I combined Minneapolis and St. Paul together in order to get one level of hosp city. So our box plot will have two boxes, one for Boston and one for the Twin Cities. So to create the data sets on GitHub, which are in CSV format, I did file safe as and saved each of these tabs as a separate CSV. Let's go to R and read them in. Jump up, and here we land in the Windows R GUI environment. And what do we have here? Of course, we are importing the CSV called MA hosp. I've already pointed this R session to the directory where I put all these files for this demonstration. So it knows where to get the CSV. You just click on the console and go to file, change directory and select the directory. Let's highlight and read in this data set and then take a look at it. I use control R. Okay, that looks familiar. Let's go to our code and look at the column names. We'll run this call names command. See that in the console, we literally have two variables, hosp name and staff beds. So I'm going to show you how easy it is to make a box plot with a variable using base R. See this code, it just says box plot, then the argument is the variable staff beds in the data frame MA hosp. I tend to refer to variables this way in R by the variable names, but you can refer to them by the column number. Anyway, let's highlight and run this and see what base R gives us for a plain unadorned box plot. Here we are. Let's notice some features of this box plot. See these whiskers? They appear. And also the box is in a vertical orientation. I'm not sure how it figures out which are outliers. See those circles at the top? I would have to look at the documentation. There are no titles or access labels, but the box plot itself is gray, which is kind of nice. Okay, enough admiring this box plot. Let's go back to our code. Now base R has a lot of commands you can use to change this box plot and make it clearer and more attractive. I'm just giving a few examples here. Notice how this code is modified from the last code. We just add a comma after the variable name, and then we add all these arguments separated by commas, which I actually put on different lines. Here I set the color to gold. I add x and y labels, and I do this notch equals true thing. Let me run it and show you what it does. See that? With a pleasing color of yellow. And the notch equals true gave the box kind of a waistline. So you can go on to deck out a base R box plot. But a lot of people are R snobs, so they don't like base R. They want everyone to use the ggplot2 graphing package. And I guess I'm one of these snobs, but only when the plot leaves the house and goes out mixing in the public. Plots that are not on public display can be in base R. Base R is good for running a quick and dirty plot while you're in the middle of an analysis, just so you can quick see the distribution of a variable. But if you're going to publish something, you should really spend your energy on making a gorgeous box plot in ggplot2, which is what I will demonstrate now. So we start here with the library command to call up the ggplot2 package, but then you'll see on the lines below that's kind of work to get a plain vanilla box plot out of ggplot2. See how we start with this top line, declare the data frame, declare the AES, which I believe stands for aesthetic. And that's where we put the X. So we tell R in the first line exactly what variable we want to graph. But look what happens when I run just this code before the plus. Oh, and I have to run the library too. Let's run this and see what happens. See this empty graph? This used to totally confuse me about ggplot2. The trick with ggplot2 is that it won't put your variable on the graph unless you tell it a shape, which is what we are going to do in the next line after the plus. See here, that's the trick with ggplot2. We are declaring the shape geome underscore boxplot. Okay, now let's run both these lines of code and actually look at our plain, ordinary, unaccessarized ggplot2 box plot. Ick, yuck. I'm sorry, but I just think this is so ugly. First, I do not like the orientation. I want it to be up and down, because that's how we do it in healthcare. So no one can do it any other way. Just kidding. But that's still the way I'm used to it. And I really hate this gray background. And what happened to the whiskers? Base R gives you default whiskers. So as you can see, if you just want a quick box plot, base R is probably your game, not ggplot2. Let's go back to our code. But actually, it totally doesn't matter that that's ugly, because ggplot2 is literally meant for beautifying plots. You will see the beautification I added here. First, I colored the box plot pink, and then I added access labels. And I used cord flip to flip the coordinates so that it is up and down and not sideways. But you will see the x and y label still refer to the horizontal orientation. Oh, well, you can't win them all. Oh, and on the last line, I call up theme underscore classic. That's how you know you have a true ggplot2 snob. They call a theme on the last line. This theme is what cleans up that ugly gray and makes my box plot look more Mondrian. Okay, let's run this code and take a look. Now that's what I call a box plot, but still no whiskers, sorry. So here's your bonus. What's always hard for me is adding the grouping variable, like for example, what if you wanted to compare staff beds in hospitals in two cities, like for example, Boston and the Twin Cities metropolitan area. Well, I'm so glad you asked. I got your data demonstration right here. So as you can see, we have a read.csv code to read in our data set and look at it. I'll just quickly run this. Okay, no surprises here. You already saw the status. Okay, so here's our base our box plot code. The main changes you will see from the previous code is that now instead of graphing just a variable, staff beds is graphing an equation. It says staff beds and then a tilde and then hosp city. That tells base R to group the boxes by hosp city. And then under the color option, I had to add another color because it has to have the same number of colors as levels in your grouping variable. I just went with pink and gold again. I thought they looked pretty enough together. Okay, let's run this code now. Okay, looks good. No surprises here either. We still have our whiskers. Now let's see what ggplot2 has to offer us. Now here we have our ggplot2 code. And the first thing we should notice is that the aes argument on the first line now has a y in it. Before we only declared x, so we just got one box. Now by declaring the y as hosp city, we tell ggplot2 what our grouping variable is. And then on the next line in the geome underscore box plot command is where we tell R that we want pink and gold for the boxes. I just modified the decked out code because, well, you know, I want it to be pretty. So let's highlight and run the ggplot2 version of a box plot. Gorgeous. Lovely. I mean, isn't it lovely? Really a nice pair of hospital box plots. Thank you for watching this data science makeover with LinkedIn learning author Monica Wahee. Remember to check out Monica's data science courses on LinkedIn learning. Click on the link in the description. Thank you for watching this video. If you thought it was a good use of your time, then please hit the like button. Also, I invite you to look around my channel. And if you like what you see, subscribe. And don't forget to slow down and smell the flowers along the way during your data science journey. Have a nice day.