 Hi everyone. I hope you're having a good day. Today is Wednesday and this is Monica Wahee. I'm just starting the chat a little early so you can get in. I'm actually kind of not ready, so I'm kind of opening some files here. So today I wanted to talk to you about box plots and percentiles because I've been doing some work on my YouTube channel and obviously I'm a data scientist and YouTube has really cool analytics. So one of the things I was noticing is people all over the place seem to want to know about box plots and percentiles. And they look really simple and cute, but they're actually not that easy to make. And they weren't that easy to teach back when I taught them. So I thought I'd do a little livestream where I could kind of show you different ways to not only make a box plot, but to teach how to make them. And if anybody has any questions about interpretation, I'm happy to talk about interpretation. They're very... I wouldn't want to say easy to interpret. They're useful and it takes you a while. You have to see them often and then you get used to knowing what they mean. So I'm happy to talk about interpretation. But the main thing that I'm here to do is sort of show you different ways you can make box plots depending on what you're doing, whether you're actually just analyzing somebody's data or your data, or you're trying to teach somebody how to analyze data, or you're trying to teach yourself, or you're trying to maybe make a blog post explaining what you did. And so I'm going to actually demonstrate a few different things from my experience in box plot land. I'm going to show you a lot in R. I love R and I'm always using R. But I'm also going to show you other things besides just R. But R is cool. So of course I'm going to show you R. So let me give you one more minute, maybe a few more people join and then I'll start my little shtick. And this live stream is going to have files associated with it. I'll post them on GitHub. I always like to go back and update the description for my live streams because I'm always going to all these websites and stuff. That way, if you're watching it now and you want the code I'm going to show you and stuff, just come back and there'll be a link to my GitHub. All right. Well, now it's time to start the live stream. My name is Monica Wahee and I'm a data scientist. But I also teach statistics. I kind of teach a lot of different things I say I teach. I mean, I'm a consultant so I just do data science. But it always seems like I'm teaching and even when I'm doing consulting, it seems because I'm always working with people because I'm an epidemiologist. I'm always working with physicians and nurses. So it seems like I'm always teaching because they don't know how to do it because I'm an epidemiologist. That's our job. So I'm supposed to know how to do it. I guess they like it working with me because I always teach them stuff. But anyway, so the purpose of this live stream was to just show you how to make box plots. Now, if you wonder what one is or all that, you can watch my video on how to make them and what percentiles mean everything. But first, let me just start by sharing my screen and talking to you about the data set that I'm going to use here. And I actually stole it from my last live stream. This is the data set where you're going to use, we're actually just going to use the staffed beds here. And let me show you where I got it. This is Massachusetts data set. And just to remind you where I got it, actually I should have done there first. Let me show you this other web page. Okay, this website here is called ahd.com, American Hospital Directory. And if you go on here under free national stats, and I explain these are statistics, they're just on the web. And why? Because they're public. Here I went to Massachusetts. I live in Massachusetts, so I'm kind of used to the hospitals here. And the city is here in the geography. So I went and I basically just turned this page into data. And then now I'm going to show you that data. I probably shouldn't have explained that first. And here's the data. It's data about hospitals. And so you have the hospital name, and then the city like I'm in Boston. And the staff beds, that's how many beds are actually staffed that they actually have staffed on that are open. Because you'll see some of these hospitals, if you come to Boston, they're really big, but they may have a lot of outpatient going on and may not have staff beds. So, and then this is total discharges, patient days, and gross revenue. We're going to be working with the staffed beds. That's what I'm going to use for the box plot. But you're welcome to download this data set and try the box plot with different things. All right, let's see here. So, all right, so why don't I start by, why don't we go to R? So I put that data set in R. So let's go to R. Okay, so let me see if this looks good. Okay, so that data set is a CSV that I read into R. And you can see here, here's my read CSV mass chooses hospital data demo set. And again, I'll put this in GitHub. I'm calling hot at Haas A. Well, you know, I'm calling it hot. I always, if you take my LinkedIn learning courses, you know, I always do this a when I, when I start with a data set, because I like to increment the suffix as I transform it. So let's look at what's in Haas A. Let's see how it comes. Yeah, we know what it looks like, right? Here's a name and all right. Okay. And again, here's the column names, the fields of the data set here. Okay, so let's just remember what a box plot is. It's actually, I probably should have shown you one here. I'll bring up one of these, one of the ones that I'm going to show here basically. Let's bring up one of these. We're going to make this one later. This is just one of them that we're doing. Let me see. So I'm going to, let's see. Okay. So here's one of the box plots. Actually, let me see if I can figure. Okay. See this box plot? Now it's a box and this top bar, this top, I guess it's called the bar, is at the 75th percentile of a quantitative variable like staff beds, right? So this would be at the 75th percentile. The bottom of the box part is at the 25th percentile. And this middle part is at the median, the 50th percentile. And then you've got the minimum and then the maximum at the top. And you've got the tail. So it's actually a box of whisker plot or I guess those are whiskers. These are tails and these are whiskers. They're not animals. I don't know why I'm going into this, but anyway, that's what we're trying to make as a box plot. Okay. And then I'm going to show you now how, so the first thing you would probably realize is that if you were going to draw one, you could draw that box. You could just draw it. But the problem is you'd have to know those things I was telling you, the 75th percentile, the 25th percentile, the median, the minimum and the maximum. And if you want to do something fancy with outliers or do all that, you can do that extra, but you would at least need to know those things. All right. So if you're good at R, you think, okay, I can just read this into R and I can just take that staff beds column and there's got to be a command in R that will tell me the percentiles. And actually you're right. There is a command in R and it's called quantile. Right. But I'll cut you the chase. If you read in that data set and you try to do quantile on it, it's not going to work. And I'll tell you why. It's because for whatever reason, see this class command, the staff beds variable reads in like a character. Maybe there's something in there. I mean, I looked at it and I couldn't figure out why I read in like a character, but you know, I didn't see any characters in it, but this is not unusual. If you do data science, this is like so normal. So I was like, well, I just want to do this demonstration. Maybe I can just use this as numeric wrapper and, you know, get out of having to do anything about it. Because if you do as numeric, you'll see they'll just list them. And these are the staff beds. And you see there's this NA. Well, one of them must have been blank. Now remember where this data comes from is I think it's quarterly data. The hospitals have to report to Medicare. That's our public insurance for elders. What's up with them? How many staff beds they have? And this American Hospital Directory makes some of that data available. Like they clean it up and make some of it available online. So who knows what happened. Maybe they didn't submit it or whatever. They probably have staff beds. Like you see all these zeros here. They probably have staff beds too. And it's just bad data. But I just was trying to do a demonstration. But anyway, so we get these NA's. And that's what our users, if you're a SAS user, you see a period here. That's what they use for null. So you can't just do quantile on as numeric staff. And then NA remove equals true. I tried that. But I think I am having an order of operation problem. SAS users will know what this means. Because if you do things in the wrong order in SAS, you get things like this where you thought you removed something, but you didn't remove it. Because you've got to put it in the right order because of the way SAS processes. So I thought, well, if it's acting like SAS, I'm going to have an R answer. So my R answer was to just, I'm just trying to demonstrate this, was to just pull out a vector and clean it up. So this is me pulling out hospital A staff beds as a vector. I called it bed-vec. Bed-vec, of course, underscore A. So when I did that, I still got NAs. But I got, this is now a vector. This is in the data set anymore at the vector. And it's a vector. How many things does it have in it? I put a length at 72. But we still need to get rid of that NA. So here's the command to get rid of the NA. I made bed-vec B. So you see how I do this? That's how I don't run out of names. And then the length of B is C71. I got rid of it. Now I can do, what was I even doing, right? I was doing the quantile. This is the quantile command. Now we can do it on that vector. And you can see, here's the, here's the minimum. Here's a 25th percentile. Here's the median. And here's the 75th. And here's 100, right? So now is a good time for me to show you what would happen if you were trying to just demonstrate to a class like drawing or whatever, you know, how to do this. Like, they got the quantile. So let's draw a box. But I just find that a lot of times I'm having to demonstrate how to draw one by hand, even though I'm teaching R, you know what I mean? So I'm just going to show you how I do that. And I'm going to show you in paint, because it's like everybody seems to have paint except for people who have max. I don't know why this opens so wide, but let me, let me then share my screen with paint up. Okay, so I can just draw on this, and I'm not a very good drawer. But what I'm going to do is look back at my percentiles. And the first thing I'm going to do is just get ready to draw something, right? So we know if we're going to make a box, but we're going to need like a Y and X axis. So I'm just making this as like, this is going to be the Y axis right here. And I already know that this is going to be zero, right? Okay. But what am I going to put at the top? Well, I'm going to sneak back here and look at my maximum. My maximum is 852. So I'm going to make the top, I'm going to cheat, make the top 900. That's how you do it on the fly here. Okay, so then what do I do? Before I do the rest of the box, I'm going to make a few lines here like, so this one's halfway, right? So I'll go and I'll say 450. But it's hard to graph like that. So I'm going to make a few more. And remember, you're just demonstrating. And you just want it to be right. And you don't want to take all day. So here I could probably do like, maybe 400 here and 500. And six, seven, eight, four, 50, 400, 300, 200, oops, 100. You can put your hand on control Z a lot. And you can put also just some notes in here. Like this was 400. Here we have 300, 800. Just to give them an orientation. All right. So I went, so that was my maximum was 852. But usually what I start drawing to make sure it doesn't turn into a hot mess. So I start with the 75th percentile, which was 270 from our quantiles. So let me go back and so where where was what did I just say was 270? That just seems like really down there. Okay, so 270 be down here somewhere. So you're going to need to make this kind of out here. I'm making just a line. So I start by making this one. If you're clever, you could make the whole box, you know, maybe we could do that. Well, that's the top of the box. The bottom of the box is going to be at the 25th percentile, which is 67.5. So that's really down here. So if you go with the box, you could have just done this. Like, well, maybe I'll just do it now. Have the box. Oops. Did I do that wrong? Yeah, I just grabbed the box. Here we go. Okay. Yeah, I guess it's a little easier. Okay. So now we've got the box part. We need to put in the median across. Just look over there, 145. And that's really close to down here. So you can already see it's skewed, right? Those of you who are used to looking at box plots, you know, that this is skewed. Where was the minimum maximum? Well, max was 852, and the minimum was zero. Remember, because we had some probably bad values. So here's zero. Let's make another one. Go from zero to 852, which is up here. It's too bad. I could put a, you know, color this over there, but I should have probably put that in there first. Or I could put them separate. Maybe it makes it a little clearer. And then up to two. You can draw the whiskers on. And of course, I wouldn't keep this. I'll keep it for you as an example of this demonstration, but I wouldn't keep it this way. Let me say that as a demonstration. I wouldn't keep it this way for the class, but I would just show them that this is how you'll roughly how you draw it. Because if you've got a class, they're going to have to draw it on something if they're trying to draw it. Like if you're not, it depends on how you make the assignment. If you're trying, if your assignment is do it in R or do it in some software, then you don't have to draw it. Okay. And I will show you an example of some online software you can use, which I actually use in my class where, you know, because if you're making them draw it and you're making them calculate all these things by hand, you can't really give them an N of 100, right? And so, so I had that, I gave them small ends to calculate by hand and make box box by hand, but I'll show you another thing that I use to, so that they could use more numbers and see bigger box lots of bigger numbers. But okay, but for now, let's just go back to R. Let's assume that you don't need to do it by hand. You can, you can just, you know, teach it on here. So where we left off is we had done our quantile. And one of the things, since we're doing quantile here, in case you wanted other percentiles or other probabilities, these are the default ones, because they know I'm making a box plot, I guess. If you wanted other ones, here's a great example. So this is it without any arguments. If you want other probabilities, like I pretended I wanted 9.95 and .05, like the 5 percentile and 95th percentile, you just make the C here, this combined, .05. And you can put as many in here as you want, and it'll tell you the probabilities right. So watch this, I'll do this, and it'll tell me. Here they are. So even the 95th percentile is pretty high. Okay, so what have we done so far? Well, first we looked at our data, which was staff beds, and then we ran the quantiles of it because we wanted to see where these things were in case we wanted to make a box plot by hand or demonstrated. All right, but now we don't want to do that by hand because obviously it's very messy. Let's do it in R. Okay, so we've got our bed vac. So base R has a box plot function, which I'll just run. Okay, there it is. So this is a box plot. So you can see it kind of looks like ours, but you know, because we had to go up here, but like the zero is here, I don't know, it's a little misleading. But you can edit this box plot, you can put titles and stuff on it. But I don't, because I'm an R snob. R snob's go, oh, are you using base R for that plot? You should be using a ggplot too. So that's what I'm saying, you should be using ggplot too. And there's 101 reasons why among them is that it's just make sexier plots. But I will make an argument for base R. And I'll tell you why, because I often want a quick and dirty box plot like I just showed you, because I'm just trying to figure out the distribution. I'm just in a hurry. You know, what I'm going to show you now, it's like your box plot, it's like your box plot is the sewing machine that's always out. And the ggplot too is kind of like the surgery, like you have to get it out and set it up and thread it up and get it going. But it makes it fancy. So where I'm going to show you just, we could just spend the whole time in ggplot too. But here's me running the library and getting it into, okay. So I find it easier, it's up to you, but I find it easier to use ggplot too with data sets than vectors. So what I did was I sort of remade our vector into our data set. So I started with hosp a, and I just converted staff beds to a variable called staff beds underscore n. Okay, so now that, and it's still mad about this. And how many rows does this have? Well, that's 72, just like the vector had 72. Okay, now I'm going to get rid of that whole record with the na in it. So this is me creating hospital B and dropping that record with the na in it. And how do you know I dropped it? Well, let's look at the n row. See, I dropped it my way. And also you can just look, you can just look at the field by here and you can see. All right. So if you want to just get a box plot out of ggplot too, you need to, you know, if you're using data, you need to declare the data. And you need to declare the aes, the x under the aes command, I guess it's a set of commands. And you've got to add a plus and call the geome underscore box plot. That's how you get the box. So let's run this ggplot. Actually, I gotta run the whole thing, right? Yeah, here we go. Okay, so you see, and you saw me drawing the box spots, like in a vertical orientation. So I don't feel good when I see him horizontally. And I don't know why. I think in healthcare, in healthcare, if you read journals, most of the time, you're going to see them in the vertical orientation, but there's nothing that's right. Just that the first thing is this freaked me out. And I was like, Oh, I don't like it. But that's the most basic. Now you see make it pretty. See, I can go crazy with it. Here, I'm making it pink, pretty, putting some labels on it. And I do the coordinate flip, right? That way, I've got my night. And you'll notice my labels are mixed up, right? The X label of staff beds, but then when you flip it, the coordinates flip. And then I put theme classics. So now we're going to get something beautiful. Look at that. So much more gorgeous. So this is way better than the thing I was drawing, right? And I don't know about this command. I'm going to have to look to see how they decide on the outliers and stuff, but they're putting the outliers in. It's nicer. Now definitely, when I publish, I definitely just totally deck out these GG plot twos. I just make them gorgeous. You know, I'm a snob. I just, I just spend my whole day on that figure. But you can't do that. If you're just looking at distributions, you've got to get your work done. So just don't get stuck in GG plot two with this. All right? However, it is easy to do. So I'm going to stick on GG plot two just for a second, because it is so sexy. Like, what if I wanted to compare box plots? Because that's what you're always doing. Well, the problem is there weren't any certain natural groups in that data set. So I decided to just pick out a few cities I just happened to know have a bunch of hospitals and just say, okay, we'll compare cities. But it's kind of not a fair comparison because I included Springfield, and you'll see in the data set, they all have two hospitals, and you really should be making a box plot about that. But so I created this vector called Keep Cities. I thought, oh, this will be a good little data sciencey thing. So this is my vector, Keep Cities, Boston, Worcester, and Springfield. Yeah, you really pronounce that Worcester. I don't learn about it. Okay. So you know that hospital B has, what, 71 rows? We're going to use the subset command, and we're going to make it so Hop City, that's one of the fields, is in. Don't you like that? In Keep Cities. This is like my favorite trick. You can do it in SAS, SAS, and in two. You can do it in the data steps, okay? So that way you can change the vector and change the cities. Like the first I just put Boston, Worcester, then I put Springfield. I thought, oh, I like these three little boxes that it produces, but probably shouldn't have made this Springfield. Okay, so we're going to make the subset of data set, Hop C. See how I don't run out, and then we'll look at the rows. We only have 15, right? Actually, let's just look at Hop C. Here, yeah, here it is, and see there's only two from Springfield. There's only four from Worcester too. You probably shouldn't do this, so this is illegal. Don't tell anybody I did this. All right, so the reason I did it, this illegal thing, was mainly to show you the sexy ggplot2, you know, that it can get you in so much trouble. But the main thing that I changed about this data is, besides changing it to Hop C now is the data, is I added a Y, which is the Hop City. Now, these happen to be just written out here. Often cities are just like they'll have some code, and then a code would come out and be ugly, and so that's a whole other thing you have to deal with the ggplot2. And then I wanted it colorful, and I knew it would find three cities, so I put three colors out here. That's the main difference between this one and this one. So let's just run it, and is that gorgeous? See, and then there's Boston, Springfield, Lister, and staff beds. That just doesn't even seem right, right? Because I think Boston should have, well, you know, it was the pandemic, and I don't even know what's going on anymore. You know, we had field hospitals and everything. I don't even know if they're in there. Who knows what's going on? All right, the healthcare system crash, sorry, are still running, but the healthcare system crash. I thought I'd better look at hospital C. So I decided I was going to write it out as a CSB here, so I did that. And then now let's actually go look at that dataset here. I'm going to bring up hospital C from what I exported here. Okay, let me go and share it on my screen with you. Okay, it's a CSB, so it comes a little ugly here. We have our staff beds and at the end here. Now, as you can see, was it Springfield this year? 735. That was that why Springfield was way up there, but I thought Boston, I don't know, everything is different, you know, the data change, right? So I'm going to highlight this and I'm going to do data sort, and I'm going to sort it by Haas City here. That way, we're going to get all these staff beds up here. I guess I can use this one for Haas, for Boston right here, because I wanted to show you something else. And that is that I actually gave this dataset to my class who was doing things by hand, they weren't using R. And but I wanted to show them how to how to deal with the boxfile with a lot of values in it that they made themselves. So I'm going to copy these values and I'm going to paste them into this calculator, this online calculator that I wouldn't use in like a real life journal. But it's great to use with students, okay? And it's on this Alcula website and they seem to have, I'm going to stop sharing and we'll go to that, they seem to have other calculators, but let's see here. But this is the only one I found used for. So this is this Alcula and this thing has been here a long time, I don't know. But if you come here, you find this and you can choose a sample or population and you'd have to read their dock as to what they're doing exactly. And also even when I was teaching how to make these by hand, I found out there's two different ways to calculate that 75th and 25th percentile by hand. So I had to deal with that and tell the students I was just doing this certain way and I don't know. This is a lot more complicated than it looks. Okay, so I'm going to just enter this data. I just pasted it. It says enter comma separated data, but let's see if it'll handle it if it's not comma separated. Yeah, did it. Okay, see that? And it says the population is nine, this is just the Boston one. And here's the, let's see here, maximum was 804, that must have been the maximum on here. Yeah, it sure was. And so it's got, it put the commas in there. And one thing you could, if you make the class do this, what I did was I had them sample, I had them take a systematic sample and make the box plot and then everybody did it and I said, we did a simulation. Let's look at the population box plot, you know, for the whole Massachusetts and see our different simulations, your different samples compare it. It was kind of a fun little thing, but they had a little trouble saving it. You have to kind of go right click and save image as, which I did for you. So you'll get, I'll show you this image on my GitHub. But anyway, and the link to this is in, is in the description of this. So let me go back here. All right. So I just went over a whole bunch of stuff really quickly. First, I went over, you know, what do you need to make a box plot by hand? And the answer is at least the minimum maximum 25th, 75th, and 50th percentile. Then I said, okay, well, how do you demonstrate drawing one? And I showed you using paint to draw one. Then I showed you two different ways of using R to make a box plot. And then finally, I showed you this calculator that you can use online to make a box plot. And so I, and I didn't get any, I haven't got any questions yet. Does anybody have any questions about box plots? Because so the, I guess I'm talking about all this, why would you even do one? Well, the main reason is you want to see the distribution of your data. You will notice that if you think about staffed beds, okay, there's going to be, and you saw all those zeros in our data. So our data were like left skewed. I always think left, light on the left is left skewed. So they were left skewed. And, or maybe they were right skewed. I don't know. I can't remember. Let's look at the box plot. Actually, they were pretty normal now. But I've noticed that like one of the things that's always skewed is length of stay. If you ever get hospital length of stays, like every single person who gets checked in and checked out, most of the time it's like one or two days. But, you know, like along COVID, everyone's who else a long time. And so that's always skewed data. Now, if you've got normally distributed data, where the median and the mean are on top of each other, you can do certain statistics then, and you can't do them if it's skewed. And so, what are the strategies? You can use different statistics, but you could also take often a log of the skewed data. You can take a log of each of the values, and then you might get a normal distribution. But what you're doing is you're making box plots the whole time to help you see where the distribution is. So this was just kind of short. I just wanted to check in and show you a few things. And then I'll put up in the description, I'll put all the links, and I'll put up a GitHub so you can go get the code. And so, and if you're here, thank you for joining me. Everybody's very quiet today. No, he's got any questions. I really appreciate you coming to my channel. Even if you're not here to make a box plot, if you would look around and see anything you like, it'd be great. Please like it or subscribe, or both. I'm trying to put up more stuff, so let's see how I do. Well, thanks for joining me on Wednesday afternoon, and I hope you have a good rest of your week.